Abstract
Social group activity recognition is a challenging task extended from group activity recognition, where social groups must be recognized with their activities and group members. Existing methods tackle this task by leveraging region features of individuals following existing group activity recognition methods. However, the effectiveness of region features is susceptible to person localization and variable semantics of individual actions. To overcome these issues, we propose leveraging attention modules in transformers to generate social group features. In this method, multiple embeddings are used to aggregate features for a social group, each of which is assigned to a group member without duplication. Due to this non-duplicated assignment, the number of embeddings must be significant to avoid missing group members and thus renders attention in transformers ineffective. To find optimal attention designs with a large number of embeddings, we explore several design choices of queries for feature aggregation and self-attention modules in transformer decoders. Extensive experimental results show that the proposed method achieves state-of-the-art performance and verify that the proposed attention designs are highly effective on social group activity recognition.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data Availability
All the datasets used in this paper are publicly available.
References
Amer, M. R., Lei, P., Todorovic, S. (2014) HiRF: Hierarchical random field for collective activity recognition in videos. In: ECCV
Amer, M. R., & Todorovic, S. (2016). Sum product networks for activity recognition. IEEE TPAMI, 38(4), 800–813.
Amer, M. R., Todorovic, S., Fern, A., Zhu, S. C. (2013) Monte carlo tree search for scheduling activity recognition. In: ICCV
Azar, S. M., Atigh, M. G., Nickabadi, A., Alahi, A. (2019) Convolutional relational machine for group activity recognition. In: CVPR
Bagautdinov, T. M., Alahi, A., Fleuret, F., Fua, P. V., Savarese, S. (2017) Social scene understanding: End-to-end multi-person action localization and collective activity recognition. In: CVPR
Bertasius, G., Wang, H., Torresani, L. (2021) Is space-time attention all you need for video understanding? In: ICML
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S. (2020) End-to-end object detection with transformers. In: ECCV
Carreira, J., & Zisserman, A. (2017). Quo vadis, action recognition? A new model and the kinetics dataset. In: CVPR.
Choi, W., Chao, Y. W., Pantofaru, C., Savarese, S. (2014) Discovering groups of people in images. In: ECCV
Choi, W., Shahid, K., Savarese, S. (2009) What are they doing? : Collective activity classification using spatio-temporal relationship among people. In: ICCVW
Dai, J., Qi, H., Xiong, Y., Li, Y., Zhang, G., Hu, H., Wei, Y. (2017) Deformable convolutional networks. In: ICCV
Deng, Z., Vahdat, A., Hu, H., Mori, G. (2016) Structure inference machines: Recurrent neural networks for analyzing relations in group activity recognition. In: CVPR
Ehsanpour, M., Abedin, A., Saleh, F., Shi, J., Reid, I., Rezatofighi, H. (2020) Joint learning of social groups, individuals action and sub-group activities in videos. In: ECCV
Gavrilyuk, K., Sanford, R., Javan, M., Snoek, C. G. M. (2020) Actor-transformers for group activity recognition. In: CVPR
Ge, W., Collins, R. T., & Ruback, R. B. (2012). Vision-based analysis of small groups in pedestrian crowds. IEEE TPAMI, 34(5), 1003–1016.
Hu, G., Cui, B., He, Y., Yu, S. (2020) Progressive relation learning for group activity recognition. In: CVPR
Ibrahim, M. S., Mori, G. (2018) Hierarchical relational networks for group activity recognition and retrieval. In: ECCV
Ibrahim, M.S., Muralidharan, S., Deng, Z., Vahdat, A., Mori, G. (2016) A hierarchical deep temporal model for group activity recognition. In: CVPR
Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vijayanarasimhan, S., Viola, F., Green, T., Back, T., Natsev, P., Suleyman, M., Zisserman, A. (2017) The kinetics human action video dataset. ArXiv:1705.06950
Kipf, T. N., Welling, M. (2017) Semi-supervised classification with graph convolutional networks. In: ICLR
Kong, L., Qin, J., Huang, D., Wang, Y., Gool, L. V. (2018) Hierarchical attention and context modeling for group activity recognition. In: ICASSP
Kuhn, H. W., Yaw, B. (1955) The hungarian method for the assignment problem. Naval Res. Logist. Quart pp. 83–97
Lan, T., Sigal, L., Mori, G. (2012) Social roles in hierarchical models for human activity recognition. In: CVPR
Lan, T., Wang, Y., Yang, W., Robinovitch, S. N., & Mori, G. (2012). Discriminative latent models for recognizing contextual group activities. IEEE TPAMI, 34(8), 1549–1562.
Li, S., Cao, Q., Liu, L., Yang, K., Liu, S., Hou, J., Yi, S. (2021) GroupFormer: Group activity recognition with clustered spatial-temporal transformer. In: ICCV (2021)
Li, X., Chuah, M. C. (2017) SBGAR: Semantics based group activity recognition. In: ICCV
Lin, T. Y., Goyal, P., Girshick, R., He, K., Dollár, P. (2017) Focal loss for dense object detection. In: ICCV
Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L. (2014) Microsoft COCO: Common objects in context. In: ECCV
Loshchilov, I., Hutter, F. (2019) Decoupled weight decay regularization. In: ICLR
Park H, Shi J. (2015) Social saliency prediction. In: CVPR
Pramono, R. R. A., Chen, Y. T., Fang, W. H. (2020) Empowering relational network by self-attention augmented conditional random fields for group activity recognition. In: ECCV
Qi, M., Qin, J., Li, A., Wang, Y., Luo, J., Gool, L. V. (2018) StagNet: An attentive semantic rnn for group activity recognition. In: ECCV
Sendo, K., Ukita, N. (2019) Heatmapping of people involved in group activities. In: MVA
Shu, T., Todorovic, S., Zhu, S. C. (2017) CERN: Confidence-energy recurrent network for group activity recognition. In: CVPR
Tamura, M., Vishwakarma, R., Vennelakanti, R. (2022) Hunting group clues with transformers for social group activity recognition. In: ECCV
Tang, J., Shu, X., Yan, R., & Zhang, L. (2022). Coherence constrained graph lstm for group activity recognition. IEEE TPAMI, 44(2), 636–647.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I. (2017) Attention is all you need. In: NIPS
Veličkovič, P., Cucurull, G., Casanova, A., Romero, A., Liò, P., Bengio, Y. (2018) Graph attention networks. In: ICLR
Wang, M., Ni, B., Yang, X. (2017) Recurrent modeling of interaction context for collective activity recognition. In: CVPR
Wang, Z., Shi, Q., Shen, C., van den Hengel, A. (2013) Bilinear programming for human activity recognition with unknown mrf graphs. In: CVPR
Wu, J., Wang, L., Wang, L., Guo, J., Wu, G. (2019) Learning actor relation graphs for group activity recognition. In: CVPR
Yan, R., Shu, X., Yuan, C., Tian, Q., & Tang, J. (2022). Position-aware participation-contributed temporal dynamic model for group activity recognition. IEEE TNNLS, 33(12), 7574–7588.
Yan, R., Xie, L., Tang, J., Shu, X., Tian, Q. (2020) HiGCIN: Hierarchical graph-based cross inference network for group activity recognition. IEEE TPAMI
Yan, R., Xie, L., Tang, J., Shu, X., Tian, Q. (2020) Social adaptive module for weakly-supervised group activity recognition. In: ECCV
Yuan, H., Ni, D., Wang, M. (2021) Spatio-temporal dynamic inference network for group activity recognition. In: ICCV
Zhou, H., Kadav, A., Shamsian, A., Geng, S., Lai, F., Zhao, L., Liu, T., Kapadia, M., Graf, H. P. (2021) COMPOSER: Compositional learning of group activity in videos. arXiv preprint arXiv:2112.05892
Zhou, X., Wang, D., Krähenbühl, P. (2019) Objects as points. ArXiv:1904.07850
Zhu, X., Su, W., Lu, L., Li, B., Wang, X., Dai, J. (2021) Deformable DETR: Deformable transformers for end-to-end object detection. In: ICLR
Acknowledgements
Computational resource of AI Bridging Cloud Infrastructure (ABCI) provided by National Institute of Advanced Industrial Science and Technology (AIST) was used.
Funding
Not applicable.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no Conflict of interest.
Additional information
Communicated by Yasushi Yagi.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Tamura, M. Design and Analysis of Efficient Attention in Transformers for Social Group Activity Recognition. Int J Comput Vis 132, 4269–4288 (2024). https://doi.org/10.1007/s11263-024-02082-y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11263-024-02082-y