research-article

InteractNet: Social Interaction Recognition for Semantic-rich Videos

Authors:

Enhong ChenAuthors Info & Claims

ACM Transactions on Multimedia Computing, Communications and Applications, Volume 20, Issue 8

Article No.: 240, Pages 1 - 21

https://doi.org/10.1145/3663668

Published: 12 June 2024 Publication History

Abstract

The overwhelming surge of online video platforms has raised an urgent need for social interaction recognition techniques. Compared with simple short-term actions, long-term social interactions in semantic-rich videos could reflect more complicated semantics such as character relationships or emotions, which will better support various downstream applications, e.g., story summarization and fine-grained clip retrieval. However, considering the longer duration of social interactions with severe mutual overlap, involving multiple characters, dynamic scenes, and multi-modal cues, among other factors, traditional solutions for short-term action recognition may probably fail in this task. To address these challenges, in this article, we propose a hierarchical graph-based system, named InteractNet, to recognize social interactions in a multi-modal perspective. Specifically, our approach first generates a semantic graph for each sampled frame with integrating multi-modal cues and then learns the node representations as short-term interaction patterns via an adapted GCN module. Along this line, global interaction representations are accumulated through a sub-clip identification module, effectively filtering out irrelevant information and resolving temporal overlaps between interactions. In the end, the association among simultaneous interactions will be captured and modelled by constructing a global-level character-pair graph to predict the final social interactions. Comprehensive experiments on publicly available datasets demonstrate the effectiveness of our approach compared with state-of-the-art baseline methods.

References

[1]

Sami Abu-El-Haija, Nisarg Kothari, Joonseok Lee, Paul Natsev, George Toderici, Balakrishnan Varadarajan, and Sudheendra Vijayanarasimhan. 2016. YouTube-8M: A large-scale video classification benchmark. arXiv preprint arXiv:1609.08675 (2016).

[2]

Mathilde Caron, Piotr Bojanowski, Armand Joulin, and Matthijs Douze. 2018. Deep clustering for unsupervised learning of visual features. In Proceedings of the European Conference on Computer Vision (ECCV’18). 132–149.

Digital Library

[3]

Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? A new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6299–6308.

[4]

Wenlong Dong, Zhongchen Ma, Qing Zhu, and Qirong Mao. 2023. Two-stage multi-instance multi-label learning model for video social relationship recognition. In Proceedings of the 4th International Conference on Intelligent Computing and Human-Computer Interaction (ICHCI’23). IEEE, 84–88.

[5]

Yazan Abu Farha and Jurgen Gall. 2019. MS-TCN: Multi-stage temporal convolutional network for action segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3575–3584.

[6]

Pedro Felzenszwalb, David McAllester, and Deva Ramanan. 2008. A discriminatively trained, multiscale, deformable part model. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 1–8.

[7]

Jason M. Grant and Patrick J. Flynn. 2017. Crowd scene understanding from video: A survey. ACM Trans. Multim. Comput., Commun. Applic. 13, 2 (2017), 1–23.

Digital Library

[8]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770–778.

[9]

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Computat. 9, 8 (1997), 1735–1780.

Digital Library

[10]

Yibo Hu, Chenyu Cao, Fangtao Li, Chenghao Yan, Jinsheng Qi, and Bin Wu. 2023. Overall-distinctive GCN for social relation recognition on videos. In Proceedings of the International Conference on Multimedia Modeling. Springer, 57–68.

Digital Library

[11]

Qingqiu Huang, Wentao Liu, and Dahua Lin. 2018. Person search in videos with one portrait through visual and temporal links. In Proceedings of the European Conference on Computer Vision (ECCV’18). 425–441.

Digital Library

[12]

Qingqiu Huang, Yu Xiong, Anyi Rao, Jiaze Wang, and Dahua Lin. 2020. MovieNet: A holistic dataset for movie understanding. In Proceedings of the European Conference on Computer Vision (ECCV’20). Springer, 709–727.

Digital Library

[13]

Diederik P. Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).

[14]

Thomas N. Kipf and Max Welling. 2016. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016).

[15]

Yu Kong and Yun Fu. 2022. Human action recognition and prediction: A survey. Int. J. Comput. Vis. 130, 5 (2022), 1366–1401.

Digital Library

[16]

Pavel Korshunov and Wei Tsang Ooi. 2011. Video quality for face detection, recognition, and tracking. ACM Trans. Multim. Comput., Commun. Applic. 7, 3 (2011), 1–21.

Digital Library

[17]

Anna Kukleva, Makarand Tapaswi, and Ivan Laptev. 2020. Learning interactions and relationships between movie characters. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9849–9858.

[18]

Colin Lea, Michael D. Flynn, Rene Vidal, Austin Reiter, and Gregory D. Hager. 2017. Temporal convolutional networks for action segmentation and detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 156–165.

[19]

Dan Li, Tong Xu, Peilun Zhou, Weidong He, Yanbin Hao, Yi Zheng, and Enhong Chen. 2021. Social context-aware person search in videos via multi-modal cues. ACM Trans. Inf. Syst. 40, 3 (2021), 1–25.

Digital Library

[20]

Shi-Jie Li, Yazan AbuFarha, Yun Liu, Ming-Ming Cheng, and Juergen Gall. 2023. MS-TCN++: Multi-stage temporal convolutional network for action segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 45, 6 (2023), 6647--6658.

[21]

Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. 2017. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision. 2980–2988.

[22]

Rui Liu and Yahong Han. 2022. Instance-sequence reasoning for video question answering. Front. Comput. Sci. 16, 6 (2022), 166708.

Digital Library

[23]

Xinchen Liu, Wu Liu, Meng Zhang, Jingwen Chen, Lianli Gao, Chenggang Yan, and Tao Mei. 2019. Social relation recognition from videos via multi-scale spatial-temporal reasoning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’19).

[24]

Jinna Lv, Wu Liu, Lili Zhou, Bin Wu, and Huadong Ma. 2018. Multi-stream fusion model for social relation recognition from videos. In Proceedings of the International Conference on Multimedia Modeling. Springer, 355–368.

[25]

Lokesh Nandanwar, Palaiahnakote Shivakumara, Divya Krishnani, Raghavendra Ramachandra, Tong Lu, Umapada Pal, and Mohan Kankanhalli. 2021. A new foreground-background based method for behavior-oriented social media image classification. ACM Trans. Multim. Comput., Commun. Applic. 17, 4 (2021), 1–25.

Digital Library

[26]

Alonso Patron-Perez, Marcin Marszalek, Ian Reid, and Andrew Zisserman. 2012. Structured learning of human interactions in TV shows. IEEE Trans. Pattern Anal. Mach. Intell. 34, 12 (2012), 2441–2453.

Digital Library

[27]

Shengsheng Qian, Tianzhu Zhang, Changsheng Xu, and M. Shamim Hossain. 2015. Social event classification via boosted multimodal supervised latent Dirichlet allocation. ACM Trans. Multim. Comput., Commun. Applic. 11, 2 (2015), 1–22.

Digital Library

[28]

Penggang Qin, Shiwei Wu, Tong Xu, Yanbin Hao, Fuli Feng, Chen Zhu, and Enhong Chen. 2023. When I fall in love: Capturing video-oriented social relationship evolution via attentive GNN. IEEE Trans. Circ. Syst. Vid. Technol. (2023). (Early Access).

[29]

Nils Reimers and Iryna Gurevych. 2020. Making monolingual sentence embeddings multilingual using knowledge distillation. arXiv preprint arXiv:2004.09813 (2020).

[30]

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 28 (2015).

[31]

Michael S. Ryoo and J. K. Aggarwal. 2010. UT-interaction dataset, ICPR contest on semantic description of human activities (SDHA). In Proceedings of the IEEE International Conference on Pattern Recognition Workshops, Vol. 2. 4.

[32]

Michael S. Ryoo and Jake K. Aggarwal. 2009. Spatio-temporal relationship match: Video structure comparison for recognition of complex human activities. In Proceedings of the IEEE 12th International Conference on Computer Vision. IEEE, 1593–1600.

[33]

Karen Simonyan and Andrew Zisserman. 2014. Two-stream convolutional networks for action recognition in videos. Adv. Neural Inf. Process. Syst. 27 (2014).

[34]

Yiyang Teng, Chenguang Song, and Bin Wu. 2022. Learning social relationship from videos via pre-trained multimodal transformer. IEEE Sig. Process. Lett. 29 (2022), 1377–1381.

[35]

Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3D convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision. 4489–4497.

Digital Library

[36]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Adv. Neural Inf. Process. Syst. 30 (2017).

[37]

Paul Vicol, Makarand Tapaswi, Lluis Castrejon, and Sanja Fidler. 2018. MovieGraphs: Towards understanding human-centric situations from videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18).

[38]

Haorui Wang, Yibo Hu, Yangfu Zhu, Jinsheng Qi, and Bin Wu. 2023. Shifted GCN-GAT and cumulative-transformer based social relation recognition for long videos. In Proceedings of the 31st ACM International Conference on Multimedia. 67–76.

Digital Library

[39]

Limin Wang, Yu Qiao, and Xiaoou Tang. 2015. Action recognition with trajectory-pooled deep-convolutional descriptors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4305–4314.

[40]

Chao-Yuan Wu and Philipp Krahenbuhl. 2021. Towards long-form video understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1884–1894.

[41]

Shiwei Wu, Joya Chen, Tong Xu, Liyi Chen, Lingfei Wu, Yao Hu, and Enhong Chen. 2021. Linking the characters: Video-oriented social graph generation via hierarchical-cumulative GCN. In Proceedings of the 29th ACM International Conference on Multimedia. 4716–4724.

Digital Library

[42]

Yu Xiong, Qingqiu Huang, Lingfeng Guo, Hang Zhou, Bolei Zhou, and Dahua Lin. 2019. A graph-based framework to bridge movies and synopses. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 4592–4601.

[43]

En Xu, Zhiwen Yu, Nuo Li, Helei Cui, Lina Yao, and Bin Guo. 2023. Quantifying predictability of sequential recommendation via logical constraints. Front. Comput. Sci. 17, 5 (2023), 175612.

Digital Library

[44]

Tong Xu, Peilun Zhou, Linkang Hu, Xiangnan He, Yao Hu, and Enhong Chen. 2021. Socializing the videos: A multimodal approach for social relation recognition. ACM Trans. Multim. Comput., Commun. Applic. 17, 1 (2021), 1–23.

Digital Library

[45]

Yuanlu Xu, Bingpeng Ma, Rui Huang, and Liang Lin. 2014. Person search in a scene by jointly modeling people commonness and person uniqueness. In Proceedings of the 22nd ACM International Conference on Multimedia. 937–940.

Digital Library

[46]

Chenghao Yan, Zihe Liu, Fangtao Li, Chenyu Cao, Zheng Wang, and Bin Wu. 2021. Social relation analysis from videos via multi-entity reasoning. In Proceedings of the International Conference on Multimedia Retrieval. 358–366.

Digital Library

[47]

Bolei Zhou, Agata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio Torralba. 2018. Places: A 10 million image database for scene recognition. IEEE Trans. Pattern Anal. Mach. Intell. 40, 6 (2018), 1452--1464.

Index Terms

InteractNet: Social Interaction Recognition for Semantic-rich Videos
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
        Scene understanding
2. Information systems
  1. Information systems applications
    1. Multimedia information systems
      1. Multimedia content creation

Recommendations

Multi-modal Mutual Topic Reinforce Modeling for Cross-media Retrieval
MM '14: Proceedings of the 22nd ACM international conference on Multimedia

As an important and challenging problem in the multimedia area, multi-modal data understanding aims to explore the intrinsic semantic information across different modalities in a collaborative manner. To address this problem, a possible solution is to ...
Synchronized movement in social interaction
Inputs-Outputs '13: Proceedings of the 2013 Inputs-Outputs Conference: An Interdisciplinary Conference on Engagement in HCI and Performance

Social interaction is a core aspect of human life that affects individuals' physical and mental health. Social interaction usually leads to mutual engagement in diverse areas of cognitive, emotional, physiological and physical activity involving both ...
Social interaction: multimodal conversation with social agents
AAAI'94: Proceedings of the Twelfth AAAI National Conference on Artificial Intelligence

We present a new approach to human-computer interaction, called social interaction. Its main characteristics are summarized by the following three points. First, interactions are realized as multimodal (verbal and nonverbal) conversation using spoken ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Multimedia Computing, Communications, and Applications

ACM Transactions on Multimedia Computing, Communications, and Applications Volume 20, Issue 8

August 2024

698 pages

ISSN:1551-6857

EISSN:1551-6865

DOI:10.1145/3618074

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 June 2024

Online AM: 03 May 2024

Accepted: 24 April 2024

Revised: 28 January 2024

Received: 31 July 2023

Published in TOMM Volume 20, Issue 8

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Natural Science Foundation of China

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
132
Total Downloads

Downloads (Last 12 months)132
Downloads (Last 6 weeks)30

Reflects downloads up to 14 Aug 2024

Other Metrics

View Author Metrics

Citations

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

Media

Figures

Other

Tables

View full text|Download PDF

View Issue’s Table of Contents