Abstract
Human-object interaction (HOI) detection is a core problem in human-centric scene understanding, which is devoted to inferring triplets < human, verb, object > between humans and objects. Previous works mainly determine the interaction of each human-object pair by performing joint inference based on multiple features. In this paper, we design more discriminative representation of the human-object pair and a more effective HOI detection model. On the one hand, we use human poses as an attention mechanism to strengthen features, which is a novel way to deal with human poses in HOI detection. On the other hand, for a more effective representation of objects, a word vector is used to encode objects, and the relation features of humans and objects are captured by a graph convolution network based on object word vectors and human appearance features. These relation features are also strengthened by a human pose attention mechanism. Our model yields favorable results compared to the state-of-the-art HOI detection algorithms on two large-scale benchmark datasets, V-COCO and HICO-DET.
Similar content being viewed by others
References
Chao YW, Liu Y, Liu X, Zeng H, Deng J (2018) Learning to detect human-object interactions. In: 2018 Ieee winter conference on applications of computer vision (wacv), IEEE, pp 381–389
Chao YW, Wang Z, He Y, Wang J, Deng J (2015) Hico: a benchmark for recognizing human-object interactions in images. In: Proceedings of the IEEE international conference on computer vision, pp 1017–1025
Chowdhary CL, Patel PV, Kathrotia KJ, Attique M, Ijaz MF (2020) Analytical study of hybrid techniques for image encryption and decryption. Sensors 20(18)
Colque RM, Caetano C, de Melo VHC, Chavez GC, Schwartz WR (2018) Novel anomalous event detection based on human-object interactions. In: VISIGRAPP (5: VISAPP), pp 293–300
Fang HS, Cao J, Tai YW, Lu C (2018) Pairwise body-part attention for recognizing human-object interactions. In: Proceedings of the European conference on computer vision (ECCV), pp 51–67
Gao C, Xu J, Zou Y, Huang JB (2020) Drg: Dual relation graph for human-object interaction detection. In: European conference on computer vision, Springer, pp 696–712
Gao C, Zou Y, Huang JB (2018) ican: Instance-centric attention network for human-object interaction detection. arXiv:1808.10437
Girshick R (2015) Fast r-cnn. Computer Science
Gkioxari G, Girshick R, Dollár P, He K (2018) Detecting and recognizing human-object interactions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 8359–8367
Gupta S, Malik J (2015) Visual semantic role labeling. arXiv preprint arXiv:1505.04474
Gupta T, Schwing A, Hoiem D (2019) No-frills human-object interaction detection: Factorization, layout encodings, and training techniques. In: Proceedings of the IEEE international conference on computer vision, pp 9677–9685
Hassan M, Dharmaratne A (2015) Labeling abnormalities in video based complex human-object interactions by robust affordance modelling. In: International conference on computer vision & image analysis applications
He K, Gkioxari G, Dollár P, Girshick R (2017) Mask r-cnn. In: Proceedings of the IEEE international conference on computer vision, pp 2961–2969
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: IEEE Conference on computer vision & pattern recognition
Huh JH, Seo YS (2019) Understanding edge computing: Engineering evolution with artificial intelligence. IEEE Access PP(99):1–1
Johnson J, Krishna R, Stark M, Li LJ, Shamma DA, Bernstein MS, Fei-Fei L (2015) Image retrieval using scene graphs. In: IEEE Conference on computer vision & pattern recognition
Kim DJ, Sun X, Choi J, Lin S, Kweon IS (2020) Detecting human-object interactions with action co-occurrence priors. In: European conference on computer vision, Springer, pp 718–736
Lee P, Yoo JH (2020) Face recognition at a distance for a stand-alone access control system. Sensors 20(3):785
Li YL, Zhou S, Huang X, Xu L, Ma Z, Fang HS, Wang Y, Lu C (2019) Transferable interactiveness knowledge for human-object interaction detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3585–3594
Liang Z, Liu J, Guan Y, Rojas J (2020) Pose-based modular network for human-object interaction detection. arXiv:2008.02042
Lin TY, Goyal P, Girshick R, He K, Dollár P (2017) Focal loss for dense object detection. In: Proceedings of the IEEE international conference on computer vision, pp 2980–2988
Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft coco: Common objects in context. In: European conference on computer vision, Springer, pp 740–755
Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu CY, Berg AC (2016) Ssd: Single shot multibox detector. In: European conference on computer vision, Springer, pp 21–37
Liu Y, Chen Q, Zisserman A (2020) Amplifying key cues for human-object-interaction detection. In: European conference on computer vision, Springer, pp 248–265
Lu J, Yang J, Batra D, Parikh D (2018) Neural baby talk. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7219–7228
Mikolov T, Grave E, Bojanowski P, Puhrsch C, Joulin A (2017) Advances in pre-training distributed word representations. arXiv:1712.09405
Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed representations of words and phrases and their compositionality. Advances in Neural Information Processing Systems 26:3111–3119
Qi S, Wang W, Jia B, Shen J, Zhu SC (2018) Learning human-object interactions by graph parsing neural networks. In: Proceedings of the European conference on computer vision (ECCV), pp 401–417
Redmon J, Divvala S, Girshick R, Farhadi A (2016) You only look once: Unified, real-time object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 779–788
Ren S, He K, Girshick R, Sun J (2017) Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell 39(6):1137–1149
Syed MR (2008) Multimedia technologies: Concepts, methodologies, tools, and applications. Media in Foreign Language Instruction 13(2):222–224
Tamang J, Nkapkop JDD, Ijaz MF, Prasad PK, Tsafack N, Saha A, Kengne J, Son Y (2021) Dynamical properties of ion-acoustic waves in space plasma and its application to image encryption. IEEE Access 9:18762–18782
Ulutan O, Iftekhar A, Manjunath BS (2020) Vsgnet: Spatial attention network for detecting human object interactions using graph convolutions. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 13617–13626
Wan B, Zhou D, Liu Y, Li R, He X (2019) Pose-aware multi-level feature network for human object interaction detection. In: Proceedings of the IEEE international conference on computer vision, pp 9469–9478
Wang H, Zheng WS, Yingbiao L (2020) Contextual heterogeneous graph network for human-object interaction detection. In: European conference on computer vision, Springer, pp 248–264
Wang T, Anwer RM, Khan MH, Khan FS, Pang Y, Shao L, Laaksonen J (2019) Deep contextual attention for human-object interaction detection. In: Proceedings of the IEEE international conference on computer vision, pp 5694–5702
Xiang T, Gong S, Lai J, Zheng W-S, Hu J-F (2016) Exemplar-based recognition of human-object interactions. IEEE Transactions on Circuits & Systems for Video Technology
Xu B, Li J, Wong Y, Zhao Q, Kankanhalli MS (2019) Interact as you intend: Intention-driven human-object interaction detection. IEEE Transactions on Multimedia 22(6):1423–1432
Zhang HB, Zhang YX, Zhong B, Lei Q, Yang L, Du JX, Chen DS (2019) A comprehensive survey of Vision-Based human action recognition methods. Sensors 19(5)
Zhang HB, Zhou YZ, Du JX, Huang JL, Yang L (2020) Improved human-object interaction detection through skeleton-object relations. Journal of Experimental & Theoretical Artificial Intelligence (1), 1–12
Zhou P, Chi M (2019) Relation parsing neural network for human-object interaction detection. In: 2019 IEEE/CVF international conference on computer vision (ICCV)
Acknowledgements
The authors would like to thank the anonymous reviewers for their valuable and insightful comments on an earlier version of this manuscript.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of Interests
The authors declare that they have no conflict of interest.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This work was supported by the Natural Science Foundation of China [No. 61871196, 62001176, 61902330 and 61673186]; National Key Research and Development Program of China [NO.2019YFC1604700]; Natural Science Foundation of Fujian Province of China [No. 2019J01082 and 2020J01085]; and the Promotion Program for Young and Middle-aged Teacher in Science and Technology Research of Huaqiao University [ZQN-YX601].
Rights and permissions
About this article
Cite this article
Deng, WM., Zhang, HB., Lei, Q. et al. Pose attention and object semantic representation-based human-object interaction detection network. Multimed Tools Appl 81, 39453–39470 (2022). https://doi.org/10.1007/s11042-022-13146-x
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-022-13146-x