Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3664647.3688975acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Advancing Micro-Action Recognition with Multi-Auxiliary Heads and Hybrid Loss Optimization

Published: 28 October 2024 Publication History

Abstract

Video action recognition has been a hot research direction in computer vision, with most existing technologies focusing on coarse-grained macro-action recognition. However, fine-grained action recognition remains challenging. Micro-actions, characterized by high fine-grained, low-intensity, and brief, are crucial for emotion recognition and psychological assessment applications. In this paper, we build on popular video action recognition frameworks as foundation models, introducing multi-auxiliary heads and hybrid loss optimization to advance micro-action recognition. Specifically, the Frame-Level pred and Coarse-Grained Body-Action auxiliary heads work collaboratively to enhance the model and Fine-Grained Micro-Action primary head for perceiving fine-grained and capturing keyframes. Incorporating F1 loss, ArcFace loss, and weighted multi-task loss improves training stability, convergence speed, and performance. Additionally, integrating the optical flow modality enriches the model's diversity, and ensemble learning across all foundational models. Finally, our method achieves a 75.37% F1-mean on the MA-52 dataset, ranking 1st in the Micro-Action Analysis Grand Challenge in conjunction with ACM MM'24. The code is available at https://github.com/qklee-lz/ACMMM2024-MAC.

References

[1]
Gedas Bertasius, Heng Wang, and Lorenzo Torresani. 2021. Is space-time attention all you need for video understanding?. In ICML, Vol. 2. 4.
[2]
Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6299--6308.
[3]
Xinlei Chen, Saining Xie, and Kaiming He. 2021. An empirical study of training self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 9640--9649.
[4]
Yin Chen, Jia Li, Shiguang Shan, Meng Wang, and Richang Hong. 2023. From static to dynamic: Adapting landmark-aware image models for facial expression recognition in videos. arXiv preprint arXiv:2312.05447 (2023).
[5]
Gunnar Farnebäck. 2003. Two-frame motion estimation based on polynomial expansion. In Image Analysis: 13th Scandinavian Conference, SCIA 2003 Halmstad, Sweden, June 29--July 2, 2003 Proceedings 13. Springer, 363--370.
[6]
Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. 2019. Slowfast networks for video recognition. In Proceedings of the IEEE/CVF international conference on computer vision. 6202--6211.
[7]
Christoph Feichtenhofer, Yanghao Li, Kaiming He, et al. 2022. Masked autoencoders as spatiotemporal learners. Advances in Neural Information Processing Systems, Vol. 35 (2022), 35946--35958.
[8]
Rohit Girdhar, Alaaeldin El-Nouby, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, and Ishan Misra. 2023. Omnimae: Single model masked pretraining on images and videos. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 10406--10417.
[9]
Jie Gui, Tuo Chen, Qiong Cao, Zhenan Sun, Hao Luo, and Dacheng Tao. 2023. A Survey of Self-Supervised Learning from Multiple Perspectives: Algorithms, Theory, Applications and Future Trends. arXiv preprint arXiv:2301.05712 (2023).
[10]
Dan Guo, Kun Li, Bin Hu, Yan Zhang, and Meng Wang. 2024. Benchmarking Micro-action Recognition: Dataset, Methods, and Applications. IEEE Transactions on Circuits and Systems for Video Technology, Vol. 34, 7 (2024), 6238--6252. https://doi.org/10.1109/TCSVT.2024.3358415
[11]
Dan Guo, Xiaobai Li, Kun Li, Haoyu Chen, Jingjing Hu, Guoying Zhao, Yi Yang, and Meng Wang. 2024. MAC 2024: Micro-Action Analysis Grand Challenge. In Proceedings of the 32nd ACM International Conference on Multimedia.
[12]
Jie Hu, Li Shen, and Gang Sun. 2018. Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 7132--7141.
[13]
Xiaolong Huang, Qiankun Li, Xueran Li, and Xuesong Gao. 2024. One Step Learning, One Step Review. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 12644--12652.
[14]
Yuqi Huo, Mingyu Ding, Haoyu Lu, Zhiwu Lu, Tao Xiang, Ji-Rong Wen, Ziyuan Huang, Jianwen Jiang, Shiwei Zhang, Mingqian Tang, et al. 2021. Self-supervised video representation learning with constrained spatiotemporal jigsaw. (2021).
[15]
Shuiwang Ji, Wei Xu, Ming Yang, and Kai Yu. 2012. 3D convolutional neural networks for human action recognition. IEEE transactions on pattern analysis and machine intelligence, Vol. 35, 1 (2012), 221--231.
[16]
Evangelos Kazakos, Arsha Nagrani, Andrew Zisserman, and Dima Damen. 2019. Epic-fusion: Audio-visual temporal binding for egocentric action recognition. In Proceedings of the IEEE/CVF international conference on computer vision. 5492--5501.
[17]
Hildegard Kuehne, Hueihan Jhuang, Estíbaliz Garrote, Tomaso Poggio, and Thomas Serre. 2011. HMDB: A large video database for human motion recognition. In 2011 International Conference on Computer Vision. IEEE, 2556--2563.
[18]
Jia Li, Jiantao Nie, Dan Guo, Richang Hong, and Meng Wang. 2022. Emotion separation and recognition from a facial expression by generating the poker face with vision transformers. arXiv preprint arXiv:2207.11081 (2022).
[19]
Kun Li, Dan Guo, Pengyu Liu, Guoliang Chen, and Meng Wang. 2024. MMAD: Multi-label Micro-Action Detection in Videos. arXiv preprint arXiv:2407.05311 (2024).
[20]
Kunchang Li, Yali Wang, Yizhuo Li, Yi Wang, Yinan He, Limin Wang, and Yu Qiao. 2023 d. Unmasked teacher: Towards training-efficient video foundation models. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 19948--19960.
[21]
Qiankun Li, Xiaolong Huang, Bo Fang, Huabao Chen, Siyuan Ding, and Xu Liu. 2023. Embracing large natural data: Enhancing medical image analysis via cross-domain fine-tuning. IEEE Journal of Biomedical and Health Informatics (2023).
[22]
Qiankun Li, Xiaolong Huang, Yuwen Luo, Xiaoyu Hu, Xinyu Sun, and Zengfu Wang. 2023. Mitigating Context Bias in Action Recognition via Skeleton-Dominated Two-Stream Network. In Proceedings of the 2023 Workshop on Advanced Multimedia Computing for Smart Manufacturing and Engineering. 65--70.
[23]
Qiankun Li, Xiaolong Huang, Zhifan Wan, Lanqing Hu, Shuzhe Wu, Jie Zhang, Shiguang Shan, and Zengfu Wang. 2023. Data-efficient masked video modeling for self-supervised action recognition. In Proceedings of the 31st ACM International Conference on Multimedia. 2723--2733.
[24]
Qiankun Li, Yimou Wang, Yani Zhang, Zhaoyu Zuo, Junxin Chen, and Wei Wang. 2024. Fuzzy-ViT: A Deep Neuro-Fuzzy System for Cross-Domain Transfer Learning from Large-scale General Data to Medical Image. IEEE Transactions on Fuzzy Systems (2024).
[25]
Qiankun Li, Xianwang Yu, Junxin Chen, Ben-Guo He, Wei Wang, Danda B Rawat, and Zhihan Lyu. 2023 e. PGA-Net: Polynomial Global Attention Network With Mean Curvature Loss for Lane Detection. IEEE Transactions on Intelligent Transportation Systems (2023).
[26]
Xin Liu, Henglin Shi, Haoyu Chen, Zitong Yu, Xiaobai Li, and Guoying Zhao. 2021. imigue: An identity-free video dataset for micro-gesture understanding and emotion analysis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 10631--10642.
[27]
Ze Liu, Jia Ning, Yue Cao, Yixuan Wei, Zheng Zhang, Stephen Lin, and Han Hu. 2022. Video swin transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3202--3211.
[28]
Jianyuan Ni, Anne HH Ngu, and Yan Yan. 2022. Progressive Cross-Modal Knowledge Distillation for Human Action Recognition. In Proceedings of the 30th ACM International Conference on Multimedia. 5903--5912.
[29]
Fatemeh Noroozi, Ciprian Adrian Corneanu, Dorota Kami'nska, Tomasz Sapi'nski, Sergio Escalera, and Gholamreza Anbarjafari. 2018. Survey on emotional body gesture recognition. IEEE transactions on affective computing, Vol. 12, 2 (2018), 505--523.
[30]
Zhaofan Qiu, Ting Yao, and Tao Mei. 2017. Learning spatio-temporal representation with pseudo-3d residual networks. In proceedings of the IEEE International Conference on Computer Vision. 5533--5541.
[31]
Hideo Saito, Thomas B Moeslund, and Rainer Lienhart. 2022. MMSports' 22: 5th International ACM Workshop on Multimedia Content Analysis in Sports. In Proceedings of the 30th ACM International Conference on Multimedia. 7386--7388.
[32]
Karen Simonyan and Andrew Zisserman. 2014. Two-stream convolutional networks for action recognition in videos. Advances in neural information processing systems, Vol. 27 (2014).
[33]
Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. 2012. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012).
[34]
Zhan Tong, Yibing Song, Jue Wang, and Limin Wang. 2022. Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training. Advances in neural information processing systems, Vol. 35 (2022), 10078--10093.
[35]
Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision. 4489--4497.
[36]
Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri. 2018. A closer look at spatiotemporal convolutions for action recognition. In CVPR. 6450--6459.
[37]
Jiangliu Wang, Jianbo Jiao, and Yun-Hui Liu. 2020. Self-supervised video representation learning by pace prediction. In Computer Vision--ECCV 2020: 16th European Conference, Glasgow, UK, August 23--28, 2020, Proceedings, Part XVII 16. Springer, 504--521.
[38]
Limin Wang, Bingkun Huang, Zhiyu Zhao, Zhan Tong, Yinan He, Yi Wang, Yali Wang, and Yu Qiao. 2023. Videomae v2: Scaling video masked autoencoders with dual masking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14549--14560.
[39]
Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. 2018. Temporal segment networks for action recognition in videos. IEEE transactions on pattern analysis and machine intelligence, Vol. 41, 11 (2018), 2740--2755.
[40]
Rui Wang, Dongdong Chen, Zuxuan Wu, Yinpeng Chen, Xiyang Dai, Mengchen Liu, Yu-Gang Jiang, Luowei Zhou, and Lu Yuan. 2022. BEVT: BERT pretraining of video transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14733--14743.
[41]
Yi Wang, Kunchang Li, Xinhao Li, Jiashuo Yu, Yinan He, Guo Chen, Baoqi Pei, Rongkun Zheng, Jilan Xu, Zun Wang, et al. 2024. Internvideo2: Scaling video foundation models for multimodal video understanding. arXiv preprint arXiv:2403.15377 (2024).
[42]
Yi Wang, Kunchang Li, Yizhuo Li, Yinan He, Bingkun Huang, Zhiyu Zhao, Hongjie Zhang, Jilan Xu, Yi Liu, Zun Wang, et al. 2022. Internvideo: General video foundation models via generative and discriminative learning. arXiv preprint arXiv:2212.03191 (2022).
[43]
Saining Xie, Chen Sun, Jonathan Huang, Zhuowen Tu, and Kevin Murphy. 2018. Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification. In Proceedings of the European conference on computer vision (ECCV). 305--321.
[44]
Shihao Xu, Jing Fang, Xiping Hu, Edith Ngai, Wei Wang, Yi Guo, and Victor CM Leung. 2022. Emotion recognition from gait analyses: Current research and future directions. IEEE Transactions on Computational Social Systems, Vol. 11, 1 (2022), 363--377.
[45]
Sravani Yenduri, Nazil Perveen, Vishnu Chalavadi, et al. 2022. Fine-grained action recognition using dynamic kernels. Pattern Recognition, Vol. 122 (2022), 108282.
[46]
Dingwen Zhang, Chaowei Fang, Wu Liu, Xinchen Liu, Jingkuan Song, Hongyuan Zhu, Wenbing Huang, and John Smith. 2022. HCMA'22: 3rd International Workshop on Human-Centric Multimedia Analysis. In Proceedings of the 30th ACM International Conference on Multimedia. 7407--7409.

Cited By

View all
  • (2024)Cross-attention swin-transformer for detailed segmentation of ancient architectural color patternsFrontiers in Neurorobotics10.3389/fnbot.2024.151348818Online publication date: 13-Dec-2024

Index Terms

  1. Advancing Micro-Action Recognition with Multi-Auxiliary Heads and Hybrid Loss Optimization

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MM '24: Proceedings of the 32nd ACM International Conference on Multimedia
    October 2024
    11719 pages
    ISBN:9798400706868
    DOI:10.1145/3664647
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 28 October 2024

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. fine-grained action recognition
    2. hybrid loss optimization
    3. multi-auxiliary heads
    4. video micro-action recognition

    Qualifiers

    • Research-article

    Conference

    MM '24
    Sponsor:
    MM '24: The 32nd ACM International Conference on Multimedia
    October 28 - November 1, 2024
    Melbourne VIC, Australia

    Acceptance Rates

    MM '24 Paper Acceptance Rate 1,150 of 4,385 submissions, 26%;
    Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)87
    • Downloads (Last 6 weeks)19
    Reflects downloads up to 11 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Cross-attention swin-transformer for detailed segmentation of ancient architectural color patternsFrontiers in Neurorobotics10.3389/fnbot.2024.151348818Online publication date: 13-Dec-2024

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media