Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3343031.3351024acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Visual Relationship Detection with Relative Location Mining

Published: 15 October 2019 Publication History

Abstract

Visual relationship detection, as a challenging task used to find and distinguish the interactions between object pairs in one image, has received much attention recently. In this work, we propose a novel visual relationship detection framework by deeply mining and utilizing relative location of object-pair in every stage of the procedure. In both the stages, relative location information of each object-pair is abstracted and encoded as auxiliary feature to improve the distinguishing capability of object-pairs proposing and predicate recognition, respectively; Moreover, one Gated Graph Neural Network(GGNN) is introduced to mine and measure the relevance of predicates using relative location. With the location-based GGNN, those non-exclusive predicates with similar spatial position can be clustered firstly and then be smoothed with close classification scores, thus the accuracy of top n recall can be increased further. Experiments on two widely used datasets VRD and VG show that, with the deeply mining and exploiting of relative location information, our proposed model significantly outperforms the current state-of-the-art.

References

[1]
Wenbin Che, Xiaopeng Fan, Ruiqin Xiong, and Debin Zhao. 2018. Paragraph Generation Network with Visual Relationship Detection. In 2018 ACM Multimedia Conference on Multimedia Conference. ACM, New York, NY, USA, 1435--1443.
[2]
Diqi Chen, Xiaodan Liang, Yizhou Wang, and Wen Gao. 2019. Soft Transfer Learning via Gradient Diagnosis for Visual Relationship Detection. In 2019 IEEE Winter Conference on Applications of Computer Vision (WACV) . IEEE, 1118--1126.
[3]
Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014).
[4]
Zhen Cui, Chunyan Xu, Wenming Zheng, and Jian Yang. 2018. Context-Dependent Diffusion Network for Visual Relationship Detection. In 2018 ACM Multimedia Conference on Multimedia Conference. ACM, New York, NY, USA, 1475--1482.
[5]
Bo Dai, Yuqi Zhang, and Dahua Lin. 2017. Detecting visual relationships with deep relational networks. In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on. IEEE, 3298--3308.
[6]
Chaitanya Desai and Deva Ramanan. 2012. Detecting actions, poses, and objects with relational phraselets. In European Conference on Computer Vision. Springer, 158--172.
[7]
Xuanyi Dong, Linchao Zhu, De Zhang, Yi Yang, and Fei Wu. 2018. Fast Parameter Adaptation for Few-shot Image Captioning and Visual Question Answering. In 2018 ACM Multimedia Conference on Multimedia Conference . ACM, New York, NY, USA, 54--62.
[8]
Carolina Galleguillos, Andrew Rabinovich, and Serge Belongie. 2008. Object categorization using co-occurrence, location and appearance. In Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on . IEEE, 1--8.
[9]
Lianli Gao, Pengpeng Zeng, Jingkuan Song, Xianglong Liu, and Heng Tao Shen. 2018. Examine before you answer: Multi-task learning with adaptive-attentions for multiple-choice VQA. In 2018 ACM Multimedia Conference on Multimedia Conference. ACM, New York, NY, USA, 1742--1750.
[10]
Ross Girshick. 2015. Fast r-cnn. In Proceedings of the IEEE international conference on computer vision. IEEE, 1440--1448.
[11]
Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. 2014. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition. IEEE, 580--587.
[12]
Chaojun Han, Fumin Shen, Li Liu, Yang Yang, and Heng Tao Shen. 2018. Visual Spatial Attention Network for Relationship Detection. In 2018 ACM Multimedia Conference on Multimedia Conference. ACM, New York, NY, USA, 510--518.
[13]
Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. 2017. Mask r-cnn. In Computer Vision (ICCV), 2017 IEEE International Conference on . IEEE, 2980--2988.
[14]
Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et almbox. 2017. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision, Vol. 123, 1 (2017), 32--73.
[15]
Jinxing Li, Bob Zhang, Guangming Lu, and David Zhang. 2018b. Shared Linear Encoder-based Gaussian Process Latent Variable Model for Visual Classification. In 2018 ACM Multimedia Conference on Multimedia Conference . ACM, New York, NY, USA, 26--34.
[16]
Yikang Li, Wanli Ouyang, Xiaogang Wang, and Xiao'ou Tang. 2017. Vip-cnn: Visual phrase guided convolutional neural network. In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on. IEEE, 7244--7253.
[17]
Yikang Li, Wanli Ouyang, Bolei Zhou, Jianping Shi, Chao Zhang, and Xiaogang Wang. 2018a. Factorizable net: an efficient subgraph-based framework for scene graph generation. In Proceedings of the European Conference on Computer Vision (ECCV). Springer, 335--351.
[18]
Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard Zemel. 2015. Gated graph sequence neural networks. arXiv preprint arXiv:1511.05493 (2015).
[19]
Xiaodan Liang, Lisa Lee, and Eric P Xing. 2017. Deep variation-structured reinforcement learning for visual relationship and attribute detection. In Proceedings of the IEEE conference on computer vision and pattern recognition. IEEE, 848--857.
[20]
Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. 2017. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2117--2125.
[21]
Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C Berg. 2016. Ssd: Single shot multibox detector. In European conference on computer vision . Springer, 21--37.
[22]
Cewu Lu, Ranjay Krishna, Michael Bernstein, and Li Fei-Fei. 2016. Visual relationship detection with language priors. In European Conference on Computer Vision. Springer, 852--869.
[23]
Jiasen Lu, Caiming Xiong, Devi Parikh, and Richard Socher. 2017. Knowing when to look: Adaptive attention via a visual sentinel for image captioning. In Proceedings of the IEEE conference on computer vision and pattern recognition. IEEE, 375--383.
[24]
Xianlong Lu, Chongyang Zhang, and Xiaokang Yang. 2014. Online video object classification using fast similarity network fusion. In 2014 IEEE Visual Communications and Image Processing Conference. IEEE, 346--349.
[25]
Francisco Massa and Ross Girshick. 2018. maskrcnn-benchmark: Fast, modular reference implementation of Instance Segmentation and Object Detection algorithms in PyTorch . https://github.com/facebookresearch/maskrcnn-benchmark .
[26]
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems. MIT Press, 3111--3119.
[27]
Julia Peyre, Ivan Laptev, Cordelia Schmid, and Josef Sivic. 2017. Weakly-supervised learning of visual relations. In ICCV 2017-International Conference on Computer Vision 2017. IEEE.
[28]
Joseph Redmon and Ali Farhadi. 2018. Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767 (2018).
[29]
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems. MIT Press, 91--99.
[30]
Marcus Rohrbach, Wei Qiu, Ivan Titov, Stefan Thater, Manfred Pinkal, and Bernt Schiele. 2013. Translating video content to natural language descriptions. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, 433--440.
[31]
Mohammad Amin Sadeghi and Ali Farhadi. 2011. Recognition using visual phrases. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on. IEEE, 1745--1752.
[32]
Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).
[33]
Sanghyun Woo, Dahun Kim, Donghyeon Cho, and In So Kweon. 2018. LinkNet: Relational Embedding for Scene Graph. In Advances in Neural Information Processing Systems. MIT Press, 558--568.
[34]
Jian Wu, Anqian Guo, Victor S Sheng, Pengpeng Zhao, Zhiming Cui, and Hua Li. 2017. Adaptive Low-Rank Multi-Label Active Learning for Image Classification. In Proceedings of the 25th ACM international conference on Multimedia. ACM, New York, NY, USA, 1336--1344.
[35]
Qi Wu, Peng Wang, Chunhua Shen, Anthony Dick, and Anton van den Hengel. 2016. Ask me anything: Free-form visual question answering based on knowledge from external sources. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 4622--4630.
[36]
Danfei Xu, Yuke Zhu, Christopher B Choy, and Li Fei-Fei. 2017. Scene graph generation by iterative message passing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 5410--5419.
[37]
Jianwei Yang, Jiasen Lu, Stefan Lee, Dhruv Batra, and Devi Parikh. 2018a. Graph r-cnn for scene graph generation. In Proceedings of the European Conference on Computer Vision (ECCV). Springer, 670--685.
[38]
Xu Yang, Hanwang Zhang, and Jianfei Cai. 2018b. Shuffle-then-assemble: learning object-agnostic visual relationship features. In Proceedings of the European Conference on Computer Vision (ECCV) . Springer, 36--52.
[39]
Guojun Yin, Lu Sheng, Bin Liu, Nenghai Yu, Xiaogang Wang, Jing Shao, and Chen Change Loy. 2018. Zoom-net: Mining deep feature interactions for visual relationship recognition. In Proceedings of the European Conference on Computer Vision (ECCV) . Springer, 322--338.
[40]
Ruichi Yu, Ang Li, Vlad I Morariu, and Larry S Davis. 2017. Visual relationship detection with internal and external linguistic knowledge distillation. In IEEE International Conference on Computer Vision (ICCV). IEEE.
[41]
Hanwang Zhang, Zawlin Kyaw, Shih-Fu Chang, and Tat-Seng Chua. 2017b. Visual translation embedding network for visual relation detection. In CVPR . IEEE, 5.
[42]
Hanwang Zhang, Zawlin Kyaw, Jinyang Yu, and Shih-Fu Chang. 2017c. PPR-FCN: weakly supervised visual relation detection via parallel pairwise R-FCN. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, 4233--4241.
[43]
Ji Zhang, Mohamed Elhoseiny, Scott Cohen, Walter Chang, and Ahmed Elgammal. 2017a. Relationship proposal networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 5678--5686.
[44]
Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba. 2016. Learning deep features for discriminative localization. In Proceedings of the IEEE conference on computer vision and pattern recognition . IEEE, 2921--2929.
[45]
Hao Zhou, Chuanping Hu, Chongyang Zhang, and Shengyang Shen. 2019. Visual Relationship Recognition via Language and Position Guided Attention. In ICASSP 2019--2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2097--2101.
[46]
Yaohui Zhu and Shuqiang Jiang. 2018. Deep structured learning for visual relationship detection. In Thirty-Second AAAI Conference on Artificial Intelligence. AAAI.
[47]
Bohan Zhuang, Lingqiao Liu, Chunhua Shen, and Ian Reid. 2017. Towards Context-Aware Interaction Recognition for Visual Relationship Detection. In Computer Vision (ICCV), 2017 IEEE International Conference on. IEEE, 589--598.

Cited By

View all
  • (2023)Pixel-level feature enhancement and weighted fusion for visual relationship detectionSecond International Conference on Electronic Information Technology (EIT 2023)10.1117/12.2685674(89)Online publication date: 15-Aug-2023
  • (2023)DBiased-P: Dual-Biased Predicate Predictor for Unbiased Scene Graph GenerationIEEE Transactions on Multimedia10.1109/TMM.2022.319013525(5319-5329)Online publication date: 1-Jan-2023
  • (2022)Divide-and-Conquer Predictor for Unbiased Scene Graph GenerationIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2022.319385732:12(8611-8622)Online publication date: Dec-2022
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MM '19: Proceedings of the 27th ACM International Conference on Multimedia
October 2019
2794 pages
ISBN:9781450368896
DOI:10.1145/3343031
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 October 2019

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. graph neural network
  2. relative location
  3. visual relationship

Qualifiers

  • Research-article

Funding Sources

Conference

MM '19
Sponsor:

Acceptance Rates

MM '19 Paper Acceptance Rate 252 of 936 submissions, 27%;
Overall Acceptance Rate 995 of 4,171 submissions, 24%

Upcoming Conference

MM '24
The 32nd ACM International Conference on Multimedia
October 28 - November 1, 2024
Melbourne , VIC , Australia

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)8
  • Downloads (Last 6 weeks)1
Reflects downloads up to 04 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2023)Pixel-level feature enhancement and weighted fusion for visual relationship detectionSecond International Conference on Electronic Information Technology (EIT 2023)10.1117/12.2685674(89)Online publication date: 15-Aug-2023
  • (2023)DBiased-P: Dual-Biased Predicate Predictor for Unbiased Scene Graph GenerationIEEE Transactions on Multimedia10.1109/TMM.2022.319013525(5319-5329)Online publication date: 1-Jan-2023
  • (2022)Divide-and-Conquer Predictor for Unbiased Scene Graph GenerationIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2022.319385732:12(8611-8622)Online publication date: Dec-2022
  • (2022)Understanding Atomic Hand-Object Interaction With Human IntentionIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2021.305868832:1(275-285)Online publication date: Jan-2022
  • (2022)Multi-Focus Guided Semantic Aggregation for Video Object DetectionICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP43922.2022.9746283(4723-4727)Online publication date: 23-May-2022
  • (2022)Fixed-Size Objects Encoding for Visual Relationship DetectionNeural Processing Letters10.1007/s11063-022-10766-054:4(3249-3261)Online publication date: 23-Feb-2022
  • (2021)Learning intra-inter semantic aggregation for video object detectionProceedings of the 2nd ACM International Conference on Multimedia in Asia10.1145/3444685.3446273(1-7)Online publication date: 7-Mar-2021
  • (2021)Detecting Human-Object Interaction with Mixed Supervision2021 IEEE Winter Conference on Applications of Computer Vision (WACV)10.1109/WACV48630.2021.00127(1227-1236)Online publication date: Jan-2021
  • (2021)Improving Visual Relationship Detection With Two-Stage Correlation ExploitationIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2020.303265031:7(2751-2763)Online publication date: Jul-2021
  • (2021)Multi-Modal Relational Graph for Cross-Modal Video Moment Retrieval2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR46437.2021.00225(2215-2224)Online publication date: Jun-2021
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media