Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Design and Analysis of Efficient Attention in Transformers for Social Group Activity Recognition

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

Social group activity recognition is a challenging task extended from group activity recognition, where social groups must be recognized with their activities and group members. Existing methods tackle this task by leveraging region features of individuals following existing group activity recognition methods. However, the effectiveness of region features is susceptible to person localization and variable semantics of individual actions. To overcome these issues, we propose leveraging attention modules in transformers to generate social group features. In this method, multiple embeddings are used to aggregate features for a social group, each of which is assigned to a group member without duplication. Due to this non-duplicated assignment, the number of embeddings must be significant to avoid missing group members and thus renders attention in transformers ineffective. To find optimal attention designs with a large number of embeddings, we explore several design choices of queries for feature aggregation and self-attention modules in transformer decoders. Extensive experimental results show that the proposed method achieves state-of-the-art performance and verify that the proposed attention designs are highly effective on social group activity recognition.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Data Availability

All the datasets used in this paper are publicly available.

References

  • Amer, M. R., Lei, P., Todorovic, S. (2014) HiRF: Hierarchical random field for collective activity recognition in videos. In: ECCV

  • Amer, M. R., & Todorovic, S. (2016). Sum product networks for activity recognition. IEEE TPAMI, 38(4), 800–813.

    Article  Google Scholar 

  • Amer, M. R., Todorovic, S., Fern, A., Zhu, S. C. (2013) Monte carlo tree search for scheduling activity recognition. In: ICCV

  • Azar, S. M., Atigh, M. G., Nickabadi, A., Alahi, A. (2019) Convolutional relational machine for group activity recognition. In: CVPR

  • Bagautdinov, T. M., Alahi, A., Fleuret, F., Fua, P. V., Savarese, S. (2017) Social scene understanding: End-to-end multi-person action localization and collective activity recognition. In: CVPR

  • Bertasius, G., Wang, H., Torresani, L. (2021) Is space-time attention all you need for video understanding? In: ICML

  • Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S. (2020) End-to-end object detection with transformers. In: ECCV

  • Carreira, J., & Zisserman, A. (2017). Quo vadis, action recognition? A new model and the kinetics dataset. In: CVPR.

  • Choi, W., Chao, Y. W., Pantofaru, C., Savarese, S. (2014) Discovering groups of people in images. In: ECCV

  • Choi, W., Shahid, K., Savarese, S. (2009) What are they doing? : Collective activity classification using spatio-temporal relationship among people. In: ICCVW

  • Dai, J., Qi, H., Xiong, Y., Li, Y., Zhang, G., Hu, H., Wei, Y. (2017) Deformable convolutional networks. In: ICCV

  • Deng, Z., Vahdat, A., Hu, H., Mori, G. (2016) Structure inference machines: Recurrent neural networks for analyzing relations in group activity recognition. In: CVPR

  • Ehsanpour, M., Abedin, A., Saleh, F., Shi, J., Reid, I., Rezatofighi, H. (2020) Joint learning of social groups, individuals action and sub-group activities in videos. In: ECCV

  • Gavrilyuk, K., Sanford, R., Javan, M., Snoek, C. G. M. (2020) Actor-transformers for group activity recognition. In: CVPR

  • Ge, W., Collins, R. T., & Ruback, R. B. (2012). Vision-based analysis of small groups in pedestrian crowds. IEEE TPAMI, 34(5), 1003–1016.

    Article  Google Scholar 

  • Hu, G., Cui, B., He, Y., Yu, S. (2020) Progressive relation learning for group activity recognition. In: CVPR

  • Ibrahim, M. S., Mori, G. (2018) Hierarchical relational networks for group activity recognition and retrieval. In: ECCV

  • Ibrahim, M.S., Muralidharan, S., Deng, Z., Vahdat, A., Mori, G. (2016) A hierarchical deep temporal model for group activity recognition. In: CVPR

  • Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vijayanarasimhan, S., Viola, F., Green, T., Back, T., Natsev, P., Suleyman, M., Zisserman, A. (2017) The kinetics human action video dataset. ArXiv:1705.06950

  • Kipf, T. N., Welling, M. (2017) Semi-supervised classification with graph convolutional networks. In: ICLR

  • Kong, L., Qin, J., Huang, D., Wang, Y., Gool, L. V. (2018) Hierarchical attention and context modeling for group activity recognition. In: ICASSP

  • Kuhn, H. W., Yaw, B. (1955) The hungarian method for the assignment problem. Naval Res. Logist. Quart pp. 83–97

  • Lan, T., Sigal, L., Mori, G. (2012) Social roles in hierarchical models for human activity recognition. In: CVPR

  • Lan, T., Wang, Y., Yang, W., Robinovitch, S. N., & Mori, G. (2012). Discriminative latent models for recognizing contextual group activities. IEEE TPAMI, 34(8), 1549–1562.

    Article  Google Scholar 

  • Li, S., Cao, Q., Liu, L., Yang, K., Liu, S., Hou, J., Yi, S. (2021) GroupFormer: Group activity recognition with clustered spatial-temporal transformer. In: ICCV (2021)

  • Li, X., Chuah, M. C. (2017) SBGAR: Semantics based group activity recognition. In: ICCV

  • Lin, T. Y., Goyal, P., Girshick, R., He, K., Dollár, P. (2017) Focal loss for dense object detection. In: ICCV

  • Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L. (2014) Microsoft COCO: Common objects in context. In: ECCV

  • Loshchilov, I., Hutter, F. (2019) Decoupled weight decay regularization. In: ICLR

  • Park H, Shi J. (2015) Social saliency prediction. In: CVPR

  • Pramono, R. R. A., Chen, Y. T., Fang, W. H. (2020) Empowering relational network by self-attention augmented conditional random fields for group activity recognition. In: ECCV

  • Qi, M., Qin, J., Li, A., Wang, Y., Luo, J., Gool, L. V. (2018) StagNet: An attentive semantic rnn for group activity recognition. In: ECCV

  • Sendo, K., Ukita, N. (2019) Heatmapping of people involved in group activities. In: MVA

  • Shu, T., Todorovic, S., Zhu, S. C. (2017) CERN: Confidence-energy recurrent network for group activity recognition. In: CVPR

  • Tamura, M., Vishwakarma, R., Vennelakanti, R. (2022) Hunting group clues with transformers for social group activity recognition. In: ECCV

  • Tang, J., Shu, X., Yan, R., & Zhang, L. (2022). Coherence constrained graph lstm for group activity recognition. IEEE TPAMI, 44(2), 636–647.

    Article  Google Scholar 

  • Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I. (2017) Attention is all you need. In: NIPS

  • Veličkovič, P., Cucurull, G., Casanova, A., Romero, A., Liò, P., Bengio, Y. (2018) Graph attention networks. In: ICLR

  • Wang, M., Ni, B., Yang, X. (2017) Recurrent modeling of interaction context for collective activity recognition. In: CVPR

  • Wang, Z., Shi, Q., Shen, C., van den Hengel, A. (2013) Bilinear programming for human activity recognition with unknown mrf graphs. In: CVPR

  • Wu, J., Wang, L., Wang, L., Guo, J., Wu, G. (2019) Learning actor relation graphs for group activity recognition. In: CVPR

  • Yan, R., Shu, X., Yuan, C., Tian, Q., & Tang, J. (2022). Position-aware participation-contributed temporal dynamic model for group activity recognition. IEEE TNNLS, 33(12), 7574–7588.

  • Yan, R., Xie, L., Tang, J., Shu, X., Tian, Q. (2020) HiGCIN: Hierarchical graph-based cross inference network for group activity recognition. IEEE TPAMI

  • Yan, R., Xie, L., Tang, J., Shu, X., Tian, Q. (2020) Social adaptive module for weakly-supervised group activity recognition. In: ECCV

  • Yuan, H., Ni, D., Wang, M. (2021) Spatio-temporal dynamic inference network for group activity recognition. In: ICCV

  • Zhou, H., Kadav, A., Shamsian, A., Geng, S., Lai, F., Zhao, L., Liu, T., Kapadia, M., Graf, H. P. (2021) COMPOSER: Compositional learning of group activity in videos. arXiv preprint arXiv:2112.05892

  • Zhou, X., Wang, D., Krähenbühl, P. (2019) Objects as points. ArXiv:1904.07850

  • Zhu, X., Su, W., Lu, L., Li, B., Wang, X., Dai, J. (2021) Deformable DETR: Deformable transformers for end-to-end object detection. In: ICLR

Download references

Acknowledgements

Computational resource of AI Bridging Cloud Infrastructure (ABCI) provided by National Institute of Advanced Industrial Science and Technology (AIST) was used.

Funding

Not applicable.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Masato Tamura.

Ethics declarations

Conflict of interest

The authors declare that they have no Conflict of interest.

Additional information

Communicated by Yasushi Yagi.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Tamura, M. Design and Analysis of Efficient Attention in Transformers for Social Group Activity Recognition. Int J Comput Vis 132, 4269–4288 (2024). https://doi.org/10.1007/s11263-024-02082-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-024-02082-y

Keywords