research-article

Visual Relationship Detection with Relative Location Mining

Authors:

Chongyang Zhang,

Chuanping HuAuthors Info & Claims

MM '19: Proceedings of the 27th ACM International Conference on Multimedia

Pages 30 - 38

https://doi.org/10.1145/3343031.3351024

Published: 15 October 2019 Publication History

Abstract

Visual relationship detection, as a challenging task used to find and distinguish the interactions between object pairs in one image, has received much attention recently. In this work, we propose a novel visual relationship detection framework by deeply mining and utilizing relative location of object-pair in every stage of the procedure. In both the stages, relative location information of each object-pair is abstracted and encoded as auxiliary feature to improve the distinguishing capability of object-pairs proposing and predicate recognition, respectively; Moreover, one Gated Graph Neural Network(GGNN) is introduced to mine and measure the relevance of predicates using relative location. With the location-based GGNN, those non-exclusive predicates with similar spatial position can be clustered firstly and then be smoothed with close classification scores, thus the accuracy of top n recall can be increased further. Experiments on two widely used datasets VRD and VG show that, with the deeply mining and exploiting of relative location information, our proposed model significantly outperforms the current state-of-the-art.

References

[1]

Wenbin Che, Xiaopeng Fan, Ruiqin Xiong, and Debin Zhao. 2018. Paragraph Generation Network with Visual Relationship Detection. In 2018 ACM Multimedia Conference on Multimedia Conference. ACM, New York, NY, USA, 1435--1443.

[2]

Diqi Chen, Xiaodan Liang, Yizhou Wang, and Wen Gao. 2019. Soft Transfer Learning via Gradient Diagnosis for Visual Relationship Detection. In 2019 IEEE Winter Conference on Applications of Computer Vision (WACV) . IEEE, 1118--1126.

[3]

Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014).

[4]

Zhen Cui, Chunyan Xu, Wenming Zheng, and Jian Yang. 2018. Context-Dependent Diffusion Network for Visual Relationship Detection. In 2018 ACM Multimedia Conference on Multimedia Conference. ACM, New York, NY, USA, 1475--1482.

[5]

Bo Dai, Yuqi Zhang, and Dahua Lin. 2017. Detecting visual relationships with deep relational networks. In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on. IEEE, 3298--3308.

[6]

Chaitanya Desai and Deva Ramanan. 2012. Detecting actions, poses, and objects with relational phraselets. In European Conference on Computer Vision. Springer, 158--172.

Digital Library

[7]

Xuanyi Dong, Linchao Zhu, De Zhang, Yi Yang, and Fei Wu. 2018. Fast Parameter Adaptation for Few-shot Image Captioning and Visual Question Answering. In 2018 ACM Multimedia Conference on Multimedia Conference . ACM, New York, NY, USA, 54--62.

[8]

Carolina Galleguillos, Andrew Rabinovich, and Serge Belongie. 2008. Object categorization using co-occurrence, location and appearance. In Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on . IEEE, 1--8.

[9]

Lianli Gao, Pengpeng Zeng, Jingkuan Song, Xianglong Liu, and Heng Tao Shen. 2018. Examine before you answer: Multi-task learning with adaptive-attentions for multiple-choice VQA. In 2018 ACM Multimedia Conference on Multimedia Conference. ACM, New York, NY, USA, 1742--1750.

Digital Library

[10]

Ross Girshick. 2015. Fast r-cnn. In Proceedings of the IEEE international conference on computer vision. IEEE, 1440--1448.

Digital Library

[11]

Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. 2014. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition. IEEE, 580--587.

Digital Library

[12]

Chaojun Han, Fumin Shen, Li Liu, Yang Yang, and Heng Tao Shen. 2018. Visual Spatial Attention Network for Relationship Detection. In 2018 ACM Multimedia Conference on Multimedia Conference. ACM, New York, NY, USA, 510--518.

[13]

Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. 2017. Mask r-cnn. In Computer Vision (ICCV), 2017 IEEE International Conference on . IEEE, 2980--2988.

[14]

Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et almbox. 2017. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision, Vol. 123, 1 (2017), 32--73.

Digital Library

[15]

Jinxing Li, Bob Zhang, Guangming Lu, and David Zhang. 2018b. Shared Linear Encoder-based Gaussian Process Latent Variable Model for Visual Classification. In 2018 ACM Multimedia Conference on Multimedia Conference . ACM, New York, NY, USA, 26--34.

[16]

Yikang Li, Wanli Ouyang, Xiaogang Wang, and Xiao'ou Tang. 2017. Vip-cnn: Visual phrase guided convolutional neural network. In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on. IEEE, 7244--7253.

[17]

Yikang Li, Wanli Ouyang, Bolei Zhou, Jianping Shi, Chao Zhang, and Xiaogang Wang. 2018a. Factorizable net: an efficient subgraph-based framework for scene graph generation. In Proceedings of the European Conference on Computer Vision (ECCV). Springer, 335--351.

[18]

Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard Zemel. 2015. Gated graph sequence neural networks. arXiv preprint arXiv:1511.05493 (2015).

[19]

Xiaodan Liang, Lisa Lee, and Eric P Xing. 2017. Deep variation-structured reinforcement learning for visual relationship and attribute detection. In Proceedings of the IEEE conference on computer vision and pattern recognition. IEEE, 848--857.

[20]

Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. 2017. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2117--2125.

[21]

Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C Berg. 2016. Ssd: Single shot multibox detector. In European conference on computer vision . Springer, 21--37.

[22]

Cewu Lu, Ranjay Krishna, Michael Bernstein, and Li Fei-Fei. 2016. Visual relationship detection with language priors. In European Conference on Computer Vision. Springer, 852--869.

[23]

Jiasen Lu, Caiming Xiong, Devi Parikh, and Richard Socher. 2017. Knowing when to look: Adaptive attention via a visual sentinel for image captioning. In Proceedings of the IEEE conference on computer vision and pattern recognition. IEEE, 375--383.

[24]

Xianlong Lu, Chongyang Zhang, and Xiaokang Yang. 2014. Online video object classification using fast similarity network fusion. In 2014 IEEE Visual Communications and Image Processing Conference. IEEE, 346--349.

[25]

Francisco Massa and Ross Girshick. 2018. maskrcnn-benchmark: Fast, modular reference implementation of Instance Segmentation and Object Detection algorithms in PyTorch . https://github.com/facebookresearch/maskrcnn-benchmark .

[26]

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems. MIT Press, 3111--3119.

Digital Library

[27]

Julia Peyre, Ivan Laptev, Cordelia Schmid, and Josef Sivic. 2017. Weakly-supervised learning of visual relations. In ICCV 2017-International Conference on Computer Vision 2017. IEEE.

[28]

Joseph Redmon and Ali Farhadi. 2018. Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767 (2018).

[29]

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems. MIT Press, 91--99.

Digital Library

[30]

Marcus Rohrbach, Wei Qiu, Ivan Titov, Stefan Thater, Manfred Pinkal, and Bernt Schiele. 2013. Translating video content to natural language descriptions. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, 433--440.

Digital Library

[31]

Mohammad Amin Sadeghi and Ali Farhadi. 2011. Recognition using visual phrases. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on. IEEE, 1745--1752.

Digital Library

[32]

Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).

[33]

Sanghyun Woo, Dahun Kim, Donghyeon Cho, and In So Kweon. 2018. LinkNet: Relational Embedding for Scene Graph. In Advances in Neural Information Processing Systems. MIT Press, 558--568.

[34]

Jian Wu, Anqian Guo, Victor S Sheng, Pengpeng Zhao, Zhiming Cui, and Hua Li. 2017. Adaptive Low-Rank Multi-Label Active Learning for Image Classification. In Proceedings of the 25th ACM international conference on Multimedia. ACM, New York, NY, USA, 1336--1344.

Digital Library

[35]

Qi Wu, Peng Wang, Chunhua Shen, Anthony Dick, and Anton van den Hengel. 2016. Ask me anything: Free-form visual question answering based on knowledge from external sources. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 4622--4630.

[36]

Danfei Xu, Yuke Zhu, Christopher B Choy, and Li Fei-Fei. 2017. Scene graph generation by iterative message passing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 5410--5419.

[37]

Jianwei Yang, Jiasen Lu, Stefan Lee, Dhruv Batra, and Devi Parikh. 2018a. Graph r-cnn for scene graph generation. In Proceedings of the European Conference on Computer Vision (ECCV). Springer, 670--685.

[38]

Xu Yang, Hanwang Zhang, and Jianfei Cai. 2018b. Shuffle-then-assemble: learning object-agnostic visual relationship features. In Proceedings of the European Conference on Computer Vision (ECCV) . Springer, 36--52.

[39]

Guojun Yin, Lu Sheng, Bin Liu, Nenghai Yu, Xiaogang Wang, Jing Shao, and Chen Change Loy. 2018. Zoom-net: Mining deep feature interactions for visual relationship recognition. In Proceedings of the European Conference on Computer Vision (ECCV) . Springer, 322--338.

[40]

Ruichi Yu, Ang Li, Vlad I Morariu, and Larry S Davis. 2017. Visual relationship detection with internal and external linguistic knowledge distillation. In IEEE International Conference on Computer Vision (ICCV). IEEE.

[41]

Hanwang Zhang, Zawlin Kyaw, Shih-Fu Chang, and Tat-Seng Chua. 2017b. Visual translation embedding network for visual relation detection. In CVPR . IEEE, 5.

[42]

Hanwang Zhang, Zawlin Kyaw, Jinyang Yu, and Shih-Fu Chang. 2017c. PPR-FCN: weakly supervised visual relation detection via parallel pairwise R-FCN. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, 4233--4241.

[43]

Ji Zhang, Mohamed Elhoseiny, Scott Cohen, Walter Chang, and Ahmed Elgammal. 2017a. Relationship proposal networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 5678--5686.

[44]

Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba. 2016. Learning deep features for discriminative localization. In Proceedings of the IEEE conference on computer vision and pattern recognition . IEEE, 2921--2929.

[45]

Hao Zhou, Chuanping Hu, Chongyang Zhang, and Shengyang Shen. 2019. Visual Relationship Recognition via Language and Position Guided Attention. In ICASSP 2019--2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2097--2101.

[46]

Yaohui Zhu and Shuqiang Jiang. 2018. Deep structured learning for visual relationship detection. In Thirty-Second AAAI Conference on Artificial Intelligence. AAAI.

[47]

Bohan Zhuang, Lingqiao Liu, Chunhua Shen, and Ian Reid. 2017. Towards Context-Aware Interaction Recognition for Visual Relationship Detection. In Computer Vision (ICCV), 2017 IEEE International Conference on. IEEE, 589--598.

Cited By

Li J(2023)Pixel-level feature enhancement and weighted fusion for visual relationship detectionSecond International Conference on Electronic Information Technology (EIT 2023)10.1117/12.2685674(89)Online publication date: 15-Aug-2023
https://doi.org/10.1117/12.2685674
Han XSong XDong XWei YLiu MNie L(2023)DBiased-P: Dual-Biased Predicate Predictor for Unbiased Scene Graph GenerationIEEE Transactions on Multimedia10.1109/TMM.2022.319013525(5319-5329)Online publication date: 1-Jan-2023
https://dl.acm.org/doi/10.1109/TMM.2022.3190135
Han XDong XSong XGan TZhan YYan YNie L(2022)Divide-and-Conquer Predictor for Unbiased Scene Graph GenerationIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2022.319385732:12(8611-8622)Online publication date: Dec-2022
https://doi.org/10.1109/TCSVT.2022.3193857
Show More Cited By

Index Terms

Visual Relationship Detection with Relative Location Mining
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision problems
        Object detection
      2. Computer vision tasks
    2. Knowledge representation and reasoning
      1. Semantic networks

Recommendations

Relative Location Oriented Mobile LBS
MoMM '13: Proceedings of International Conference on Advances in Mobile Computing & Multimedia

Mobile location-based services (LBS) usually define locations in an absolute way, mainly coordinates. However, this absolute presentation of location rarely appears in the context of daily life. Instead, people tend to use relative location, defined on ...
Multi-Class Segmentation with Relative Location Prior

Multi-class image segmentation has made significant advances in recent years through the combination of local and global features. One important type of global feature is that of inter-class spatial relationships. For example, identifying "tree" pixels ...
2.5D visual relationship detection
Abstract
Visual 2.5D perception involves understanding the semantics and geometry of a scene through reasoning about object relationships with respect to the viewer. However, existing works in visual recognition primarily focus on the ...
Highlights
- Proposed 2.5D visual relationship detection (2.5VRD).
- Highlight object-centric ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '19: Proceedings of the 27th ACM International Conference on Multimedia

October 2019

2794 pages

ISBN:9781450368896

DOI:10.1145/3343031

General Chairs:
Laurent Amsaleg
CNRS-IRISA, France
,
Benoit Huet
EURECOM, France
,
Martha Larson
Radboud University and TU Delft (Netherlands)
,
Program Chairs:
Guillaume Gravier
CNRS-IRISA, France
,
Hayley Hung
Delft University of Technology Netherlands
,
Chong-Wah Ngo
City University of Hong Kong Hong Kong
,
Wei Tsang Ooi
National University of Singapore Singapore

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 October 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Natural Science Foundation of China
National Key Research and Development Program of China
Science and Technology Commission of Shanghai Municipality

Conference

MM '19

Sponsor:

SIGMM

MM '19: The 27th ACM International Conference on Multimedia

October 21 - 25, 2019

Nice, France

Acceptance Rates

MM '19 Paper Acceptance Rate 252 of 936 submissions, 27%;

Overall Acceptance Rate 995 of 4,171 submissions, 24%

Upcoming Conference

MM '24

Sponsor:
sigmm

The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne , VIC , Australia

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

11
Total Citations
View Citations
326
Total Downloads

Downloads (Last 12 months)8
Downloads (Last 6 weeks)1

Reflects downloads up to 04 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Li J(2023)Pixel-level feature enhancement and weighted fusion for visual relationship detectionSecond International Conference on Electronic Information Technology (EIT 2023)10.1117/12.2685674(89)Online publication date: 15-Aug-2023
https://doi.org/10.1117/12.2685674
Han XSong XDong XWei YLiu MNie L(2023)DBiased-P: Dual-Biased Predicate Predictor for Unbiased Scene Graph GenerationIEEE Transactions on Multimedia10.1109/TMM.2022.319013525(5319-5329)Online publication date: 1-Jan-2023
https://dl.acm.org/doi/10.1109/TMM.2022.3190135
Han XDong XSong XGan TZhan YYan YNie L(2022)Divide-and-Conquer Predictor for Unbiased Scene Graph GenerationIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2022.319385732:12(8611-8622)Online publication date: Dec-2022
https://doi.org/10.1109/TCSVT.2022.3193857
Fan HZhuo TYu XYang YKankanhalli M(2022)Understanding Atomic Hand-Object Interaction With Human IntentionIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2021.305868832:1(275-285)Online publication date: Jan-2022
https://doi.org/10.1109/TCSVT.2021.3058688
Ye HWang GLu YYan YWang H(2022)Multi-Focus Guided Semantic Aggregation for Video Object DetectionICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP43922.2022.9746283(4723-4727)Online publication date: 23-May-2022
https://doi.org/10.1109/ICASSP43922.2022.9746283
Pan HNiu XShen SChen YQiao PHuang ZLi D(2022)Fixed-Size Objects Encoding for Visual Relationship DetectionNeural Processing Letters10.1007/s11063-022-10766-054:4(3249-3261)Online publication date: 23-Feb-2022
https://doi.org/10.1007/s11063-022-10766-0
Liang JChen HDu KYan YWang HChua TWang JTian QGurrin CJia JZhang HSun Q(2021)Learning intra-inter semantic aggregation for video object detectionProceedings of the 2nd ACM International Conference on Multimedia in Asia10.1145/3444685.3446273(1-7)Online publication date: 7-Mar-2021
https://dl.acm.org/doi/10.1145/3444685.3446273
Kumaraswamy SShi MKijak E(2021)Detecting Human-Object Interaction with Mixed Supervision2021 IEEE Winter Conference on Applications of Computer Vision (WACV)10.1109/WACV48630.2021.00127(1227-1236)Online publication date: Jan-2021
https://doi.org/10.1109/WACV48630.2021.00127
Zhou HZhang CZhao MLuo YHu C(2021)Improving Visual Relationship Detection With Two-Stage Correlation ExploitationIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2020.303265031:7(2751-2763)Online publication date: Jul-2021
https://doi.org/10.1109/TCSVT.2020.3032650
Zeng YCao DWei XLiu MZhao ZQin Z(2021)Multi-Modal Relational Graph for Cross-Modal Video Moment Retrieval2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR46437.2021.00225(2215-2224)Online publication date: Jun-2021
https://doi.org/10.1109/CVPR46437.2021.00225
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents