Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Psychology-Guided Environment Aware Network for Discovering Social Interaction Groups from Videos

Published: 13 June 2024 Publication History

Abstract

Social interaction is a common phenomenon in human societies. Different from discovering groups based on the similarity of individuals’ actions, social interaction focuses more on the mutual influence between people. Although people can easily judge whether or not there are social interactions in a real-world scene, it is difficult for an intelligent system to discover social interactions. Initiating and concluding social interactions are greatly influenced by an individual’s social cognition and the surrounding environment, which are closely related to psychology. Thus, converting the psychological factors that impact social interactions into quantifiable visual representations and creating a model for interaction relationships poses a significant challenge. To this end, we propose a Psychology-Guided Environment Aware Network (PEAN) that models social interaction among people in videos using supervised learning. Specifically, we divide the surrounding environment into scene-aware visual-based and human-aware visual-based descriptions. For the scene-aware visual clue, we utilize 3D features as global visual representations. For the human-aware visual clue, we consider instance-based location and behaviour-related visual representations to map human-centred interaction elements in social psychology: distance, openness, and orientation. In addition, we design an environment aware mechanism to integrate features from visual clues, with a Transformer to explore the relation between individuals and construct pairwise interaction strength features. The interaction intensity matrix reflecting the mutual nature of the interaction is obtained by processing the interaction strength features with the interaction discovery module. An interaction constrained loss function composed of interaction critical loss function and smooth Fβ loss function is proposed to optimize the whole framework to improve the distinction of the interaction matrix and alleviate class imbalance caused by pairwise interaction sparsity. Given the diversity of real-world interactions, we collect a new dataset named Social Basketball Activity Dataset (Soical-BAD), covering complex social interactions. Our method achieves the best performance among social-CAD, social-BAD, and their combined dataset named Video Social Interaction Dataset (VSID).

Supplementary Material

3657295.supp (3657295.supp.pdf)
Supplementary material

References

[1]
Xavier Alameda-Pineda, Jacopo Staiano, Ramanathan Subramanian, Ligia Batrinca, Elisa Ricci, Bruno Lepri, Oswald Lanz, and Nicu Sebe. 2015. Salsa: A novel dataset for multimodal group behavior analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence 38, 8 (2015), 1707–1720.
[2]
Jie Zhou, Ganqu Cui, Shengding Hu, Zhengyan Zhang, Cheng Yang, Zhiyuan Liu, Lifeng Wang, Changcheng Li, and Maosong Sun. 2020. Graph neural networks: A review of methods and applications. AI open 1, (2020), 57–81.
[3]
William M. Baum. 1981. Optimization and the matching law as accounts of instrumental behavior. Journal of the Experimental Analysis of Behavior 36, 3 (1981), 387–403.
[4]
William M. Baum. 2018. Multiscale behavior analysis and molar behaviorism: An overview. Journal of the Experimental Analysis of Behavior 110, 3 (2018), 302–322.
[5]
Gabriel Bénédict, Vincent Koops, Daan Odijk, and Maarten de Rijke. 2021. sigmoidF1: A smooth F1 score surrogate loss for multilabel classification. Transactions on Machine Learning Research 2022 (2022). https://openreview.net/forum?id=gvSHaaD2wQ
[6]
Laura Cabrera-Quiros, Andrew Demetriou, Ekin Gedik, Leander van der Meij, and Hayley Hung. 2018. The MatchNMingle dataset: A novel multi-sensor resource for the analysis of social interactions and group dynamics in-the-wild during free-standing conversations and speed dates. IEEE Transactions on Affective Computing 12, 1 (2018), 113–130.
[7]
Nico Carpentier. 2015. Differentiating between access, interaction and participation. Conjunctions. Transdisciplinary Journal of Cultural Participation 2, 2 (2015), 7–28.
[8]
Datong Chen, Jie Yang, Robert Malkin, and Howard D Wactlar. 2007. Detecting social interactions of the elderly in a nursing home environment. ACM Transactions on Multimedia Computing, Communications, and Applications, 3 1 (2007), 6–es.
[9]
Wongun Choi, Yu-Wei Chao, Caroline Pantofaru, and Silvio Savarese. 2014. Discovering groups of people in images. In Proceedings of the European Conference on Computer Vision. Springer, 417–433.
[10]
Wongun Choi, Khuram Shahid, and Silvio Savarese. 2009. What are they doing?: Collective activity classification using spatio-temporal relationship among people. In Proceedings of the 2009 IEEE 12th International Conference on Computer Vision Workshops, ICCV Workshops. 1282–1289.
[11]
Robert B. Cialdini and Noah J. Goldstein. 2004. Social influence: Compliance and conformity. Annual Review of Psychology 55, 1 (2004), 591–621.
[12]
Marco Cristani, Loris Bazzani, Giulia Paggetti, Andrea Fossati, Diego Tosato, Alessio Del Bue, Gloria Menegaz, and Vittorio Murino. 2011. Social interaction discovery by statistical analysis of F-formations. In Proceedings of the BMVC. Citeseer, 4.
[13]
Hanne De Jaegher, Ezequiel Di Paolo, and Shaun Gallagher. 2010. Can social interaction constitute social cognition? Trends in Cognitive Sciences 14, 10 (2010), 441–447.
[14]
C. Nathan DeWall and Stephanie B. Richman. 2011. Social exclusion and the desire to reconnect. Social and Personality Psychology Compass 5, 11 (2011), 919–932.
[15]
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2021. An image is worth 16x16 words: Transformers for image recognition at scale. In Proceedings of the 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021.
[16]
Mahsa Ehsanpour, Alireza Abedin, Fatemeh Saleh, Javen Shi, Ian Reid, and Hamid Rezatofighi. 2020. Joint learning of social groups, individuals action and sub-group activities in videos. In Proceedings of the European Conference on Computer Vision. 177–195.
[17]
Christoph Feichtenhofer. 2020. X3d: Expanding architectures for efficient video recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 203–213.
[18]
Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. 2019. Slowfast networks for video recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 6202–6211.
[19]
Ruihao Gong, Xianglong Liu, Shenghu Jiang, Tianxiang Li, Peng Hu, Jiazhen Lin, Fengwei Yu, and Junjie Yan. 2019. Differentiable soft quantization: Bridging full-precision and low-bit neural networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 4852–4861.
[20]
Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 6299–6308.
[21]
Judith A. Hall, Erik J. Coats, and Lavonia Smith LeBeau. 2005. Nonverbal behavior and the vertical dimension of social relations: A meta-analysis. Psychological Bulletin 131, 6 (2005), 898.
[22]
Leslie A. Hayduk. 1981. The shape of personal space: An experimental investigation. Canadian Journal of Behavioural Science/Revue Canadienne Des Sciences du Comportement 13, 1 (1981), 87.
[23]
Karlijn SFM Hermans, Zuzana Kasanova, Leonardo Zapata-Fonseca, Ginette Lafit, Ruben Fossion, Tom Froese, and Inez Myin-Germeys. 2020. Investigating real-time social interaction in pairs of adolescents with the perceptual crossing experiment. Behavior Research Methods 52, 5 (2020), 1929–1938.
[24]
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Computation 9, 8 (1997), 1735–1780.
[25]
Han Hu, Jiayuan Gu, Zheng Zhang, Jifeng Dai, and Yichen Wei. 2018. Relation networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3588–3597.
[26]
Hayley Hung and Ben Kröse. 2011. Detecting f-formations as dominant sets. In Proceedings of the 13th International Conference on Multimodal Interfaces. 231–238.
[27]
Mostafa S. Ibrahim, Srikanth Muralidharan, Zhiwei Deng, Arash Vahdat, and Greg Mori. 2016. A hierarchical deep temporal model for group activity recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1971–1980.
[28]
Ashesh Jain, Amir R. Zamir, Silvio Savarese, and Ashutosh Saxena. 2016. Structural-rnn: Deep learning on spatio-temporal graphs. In Proceedings of the ieee Conference on Computer Vision and Pattern Recognition. 5308–5317.
[29]
Shuiwang Ji, Wei Xu, Ming Yang, and Kai Yu. 2012. 3D convolutional neural networks for human action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 35, 1 (2012), 221–231.
[30]
Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, Mustafa Suleyman, and Andrew Zisserman. 2017. The kinetics human action video dataset. arXiv:1705.06950. Retrieved from https://arxiv.org/abs/1705.06950
[31]
Adam Kendon. 2009. Spacing and orientation in co-present interaction. In Development of Multimodal Interfaces: Active Listening and Synchrony, Second COST 2102 International Training School, Dublin, Ireland, March 23-27, 2009, Revised Selected Papers, Springer, 1–15.
[32]
Longteng Kong, Duoxuan Pei, Rui He, Di Huang, and Yunhong Wang. 2022. Spatio-temporal player relation modeling for tactic recognition in sports videos. IEEE Transactions on Circuits and Systems for Video Technology 32, 9 (2022), 6086–6099.
[33]
Yu Kong, Yunde Jia, and Yun Fu. 2014. Interactive phrases: Semantic descriptionsfor human interaction recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 36, 9 (2014), 1775–1788.
[34]
Michael Kubovy and Johan Wagemans. 1995. Grouping by proximity and multistability in dot lattices: A quantitative Gestalt theory. Psychological Science 6, 4 (1995), 225–234.
[35]
Lindong Li, Linbo Qing, Li Guo, and Yonghong Peng. 2023. Relationship existence recognition-based social group detection in urban public spaces. Neurocomputing 516 (2023), 92–105.
[36]
Shuaicheng Li, Qianggang Cao, Lingbo Liu, Kunlin Yang, Shinan Liu, Jun Hou, and Shuai Yi. 2021. GroupFormer: Group activity recognition with clustered spatial-temporal transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 13668–13677.
[37]
Junwei Liang, Lu Jiang, Juan Carlos Niebles, Alexander G Hauptmann, and Li Fei-Fei. 2019. Peeking into the future: Predicting future person activities and locations in videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5725–5734.
[38]
Shuheng Lin, Hua Yang, Xianchao Tang, Tianqi Shi, and Lin Chen. 2019. Social mil: Interaction-aware for crowd anomaly detection. In Proceedings of the 2019 16th IEEE International Conference on Advanced Video and Signal Based Surveillance. IEEE, 1–8.
[39]
Zelun Luo, Wanze Xie, Siddharth Kapoor, Yiyun Liang, Michael Cooper, Juan Carlos Niebles, Ehsan Adeli, and Fei-Fei Li. 2021. Moma: Multi-object multi-actor activity parsing. Advances in Neural Information Processing Systems 34 (2021), 17939–17955.
[40]
Vijay Mahadevan, Weixin Li, Viral Bhalodia, and Nuno Vasconcelos. 2010. Anomaly detection in crowded scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 1975–1981.
[41]
Arne Nagels, Tilo Kircher, Miriam Steines, and Benjamin Straube. 2015. Feeling addressed! The role of body orientation and co-speech gesture in social communication. Human Brain Mapping 36, 5 (2015), 1925–1936.
[42]
Mauricio Perez, Jun Liu, and Alex C. Kot. 2022. Skeleton-based relational reasoning for group activity analysis. Pattern Recognition 122 (2022), 108360.
[43]
Mengshi Qi, Yunhong Wang, Annan Li, and Jiebo Luo. 2019. Sports video captioning via attentive motion representation and group relationship modeling. IEEE Transactions on Circuits and Systems for Video Technology 30, 8 (2019), 2617–2633.
[44]
Mengshi Qi, Yunhong Wang, Jie Qin, Annan Li, Jiebo Luo, and Luc Van Gool. 2019. StagNet: An attentive semantic RNN for group activity and individual action recognition. IEEE Transactions on Circuits and Systems for Video Technology 30, 2 (2019), 549–565.
[45]
Lisa Rashotte. 2007. Social influence. The Blackwell Encyclopedia of Sociology (2007).
[46]
Harry T. Reis, Michael R. Maniaci, Peter A. Caprariello, Paul W. Eastwick, and Eli J. Finkel. 2011. Familiarity does indeed promote attraction in live interaction. Journal of Personality and Social Psychology 101, 3 (2011), 557.
[47]
Harry T. Reis, Ladd Wheeler, Nancy Spiegel, Michael H. Kernis, John Nezlek, and Michael Perri. 1982. Physical attractiveness in social interaction: II. Why does appearance affect social experience? Journal of Personality and Social Psychology 43, 5 (1982), 979.
[48]
Viktor Schmuck and Oya Celiktutan. 2021. GROWL: Group detection with link prediction. In Proceedings of the 2021 16th IEEE International Conference on Automatic Face and Gesture Recognition. IEEE, 1–8.
[49]
Viktor Schmuck and Oya Celiktutan. 2022. iGROWL: Improved group detection with link prediction. IEEE Transactions on Biometrics, Behavior, and Identity Science 5, 5 (2022), 400–410.
[50]
Francesco Setti, Oswald Lanz, Roberta Ferrario, Vittorio Murino, and Marco Cristani. 2013. Multi-scale F-formation discovery for group detection. In Proceedings of the 2013 IEEE International Conference on Image Processing. IEEE, 3547–3551.
[51]
Francesco Setti, Chris Russell, Chiara Bassetti, and Marco Cristani. 2015. F-formation detection: Individuating free-standing conversational groups in images. PloS One 10, 5 (2015), e0123783.
[52]
Saurabh Singh and Shankar Krishnan. 2020. Filter response normalization layer: Eliminating batch dependence in the training of deep neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11237–11246.
[53]
Hyun Soo Park and Jianbo Shi. 2015. Social saliency prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4777–4785.
[54]
Jie Su, Jianglan Huang, Linbo Qing, Xiaohai He, and Honggang Chen. 2022. A new approach for social group detection based on spatio-temporal interpersonal distance measurement. Heliyon 8, 10 (2022), e11038.
[55]
Mason Swofford, John Peruzzi, Nathan Tsoi, Sydney Thompson, Roberto Martín-Martín, Silvio Savarese, and Marynel Vázquez. 2020. Improving social awareness through dante: Deep affinity network for clustering conversational interactants. Proceedings of the ACM on Human-Computer Interaction 4, CSCW1 (2020), 1–23.
[56]
Masato Tamura, Rahul Vishwakarma, and Ravigopal Vennelakanti. 2022. Hunting group clues with transformers for social group activity recognition. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part IV. Springer, 19–35.
[57]
Stephanie Tan, David MJ Tax, and Hayley Hung. 2022. Conversation group detection with spatio-temporal context. In Proceedings of the 2022 International Conference on Multimodal Interaction. 170–180.
[58]
Sydney Thompson, Abhijit Gupta, Anjali W Gupta, Austin Chen, and Marynel Vázquez. 2021. Conversational group detection with graph neural networks. In Proceedings of the 2021 International Conference on Multimodal Interaction. 248–252.
[59]
Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision. 4489–4497.
[60]
Tanya Vacharkulksemsuk, Emily Reit, Poruz Khambatta, Paul W. Eastwick, Eli J. Finkel, and Dana R. Carney. 2016. Dominant, open nonverbal displays are attractive at zero-acquaintance. Proceedings of the National Academy of Sciences 113, 15 (2016), 4009–4014.
[61]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in Neural Information Processing Systems 30 (2017), 5998–6008.
[62]
Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. 2018. Graph attention networks. In Proceedings of the International Conference on Learning Representations.
[63]
Khoa Vo, Hyekang Joo, Kashu Yamazaki, Sang Truong, Kris Kitani, Minh-Triet Tran, and Ngan Le. 2021. AEI: Actors-environment interaction with adaptive attention for temporal action proposals generation. In 32nd British Machine Vision Conference 2021, BMVC 2021, Online, November 22-25, 2021, BMVA Press, 111.
[64]
Zhenhua Wang, Sheng Liu, Jianhua Zhang, Shengyong Chen, and Qiu Guan. 2016. A spatio-temporal CRF for human interaction understanding. IEEE Transactions on Circuits and Systems for Video Technology 27, 8 (2016), 1647–1660.
[65]
Zhenhua Wang, Jiajun Meng, Dongyan Guo, Jianhua Zhang, Javen Qinfeng Shi, and Shengyong Chen. 2021. Consistency-aware graph network for human interaction understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 13369–13378.
[66]
Jianchao Wu, Limin Wang, Li Wang, Jie Guo, and Gangshan Wu. 2019. Learning actor relation graphs for group activity recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 9964–9974.
[67]
Yichao Yan, Bingbing Ni, and Xiaokang Yang. 2017. Predicting human interaction via relative attention model. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI’17), ijcai.org, 3245–3251.
[68]
Haanju Yoo, Taekyu Eom, Jeongmin Seo, and Sang-Il Choi. 2019. Detection of interacting groups based on geometric and social relations between individuals in an image. Pattern Recognition 93 (2019), 498–506.
[69]
Jiaqi Yu. 2021. PRN: Psychology-inspired relation network for detecting social interaction groups from single images. In 32nd British Machine Vision Conference 2021, BMVC 2021. Online, BMVA Press, 364.
[70]
Hangjie Yuan and Dong Ni. 2021. Learning visual context for group activity recognition. In Proceedings of the AAAI Conference on Artificial Intelligence. 3261–3269.
[71]
Gloria Zen, Bruno Lepri, Elisa Ricci, and Oswald Lanz. 2010. Space speaks: Towards socially and personality aware visual surveillance. In Proceedings of the 1st ACM International Workshop on Multimodal Pervasive Video Analysis. 37–42.
[72]
Zhen Zhang, Fan Wu, and Wee Sun Lee. 2020. Factor graph neural networks. Advances in Neural Information Processing Systems 33 (2020), 8577–8587.
[73]
Chen Zhou, Ming Han, Qi Liang, Yi-Fei Hu, and Shu-Guang Kuai. 2019. A social interaction field model accurately identifies static and dynamic social groupings. Nature Human Behaviour 3, 8 (2019), 847–855.
[74]
Honglu Zhou, Asim Kadav, Aviv Shamsian, Shijie Geng, Farley Lai, Long Zhao, Ting Liu, Mubbasir Kapadia, and Hans Peter Graf. 2022. COMPOSER: Compositional reasoning of group activity in videos with keypoint-only modality. In Proceedings of the European Conference on Computer Vision. Springer, 249–266.
[75]
Liping Zhu, Bohua Wan, Chengyang Li, Gangyi Tian, Yi Hou, and Kun Yuan. 2021. Dyadic relational graph convolutional networks for skeleton-based human interaction recognition. Pattern Recognition 115 (2021), 107920.

Index Terms

  1. Psychology-Guided Environment Aware Network for Discovering Social Interaction Groups from Videos

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Multimedia Computing, Communications, and Applications
    ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 20, Issue 8
    August 2024
    726 pages
    EISSN:1551-6865
    DOI:10.1145/3618074
    • Editor:
    • Abdulmotaleb El Saddik
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 13 June 2024
    Online AM: 09 April 2024
    Accepted: 03 April 2024
    Revised: 27 February 2024
    Received: 12 July 2023
    Published in TOMM Volume 20, Issue 8

    Check for updates

    Author Tags

    1. Social interaction
    2. psychological elements
    3. environment aware
    4. interaction constrained loss function

    Qualifiers

    • Research-article

    Funding Sources

    • National Natural Science Foundation of China
    • Science and Technology Commission of Shanghai Municipality

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 222
      Total Downloads
    • Downloads (Last 12 months)222
    • Downloads (Last 6 weeks)26
    Reflects downloads up to 13 Jan 2025

    Other Metrics

    Citations

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    Full Text

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media