research-article

Micro-Action Recognition via Hierarchical Fusion and Inference

Authors:

Gang XuAuthors Info & Claims

MM '24: Proceedings of the 32nd ACM International Conference on Multimedia

Pages 11327 - 11332

https://doi.org/10.1145/3664647.3688977

Published: 28 October 2024 Publication History

Abstract

Micro-actions are spontaneous body movements that indicate a person's true feelings and potential intentions, and micro-action recognition is important in human behavior analysis. Yet, recognizing micro-actions is challenging because they are subtle and appear for a very short time compared to normal actions. In this paper, we propose a micro-action recognition framework based on Hierarchical Fusion and Inference (HiFI) to capture subtle multimodal information. Specifically, we first hierarchically integrate multimodal local and global information, including the 2D key-points of faces, hands and bodies, the depth information, and the RGB image sequences. Afterward, both 3D-CNNs and Transformers are used to effectively capture local and long-range dependence. Finally, we propose a novel from-fine-to-coarse (F2C) inference strategy, based on hybrid ensemble of multi-branches, to boost the accuracy and credibility of coarse action recognition. Our solution ranked 4th in the MAC Challenge Track 1.

References

[1]

Hamid Ahmadabadi, Omid Nejati Manzari, and Ahmad Ayatollahi. 2023. Distilling Knowledge from CNN-Transformer Models for Enhanced Human Action Recognition. In 2023 13th International Conference on Computer and Knowledge Engineering (ICCKE). IEEE, 180--184.

[2]

Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6299--6308.

[3]

Yuxin Chen, Ziqi Zhang, Chunfeng Yuan, Bing Li, Ying Deng, and Weiming Hu. 2021. Channel-wise topology refinement graph convolution for skeleton-based action recognition. In Proceedings of the IEEE/CVF international conference on computer vision. 13359--13368.

[4]

Adrian K. Davison, Jingting Li, Moi Hoon Yap, John See, Wen-Huang Cheng, Xiaobai Li, Xiaopeng Hong, and Su-Jing Wang. 2023. MEGC2023: ACM Multimedia 2023 ME Grand Challenge. In Proceedings of the 31st ACM International Conference on Multimedia (Ottawa ON, Canada) (MM '23). Association for Computing Machinery, New York, NY, USA, 9625--9629. https://doi.org/10.1145/3581783.3612833

Digital Library

[5]

Haodong Duan, Yue Zhao, Kai Chen, Dahua Lin, and Bo Dai. 2022. Revisiting skeleton-based action recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2969--2978.

[6]

Saijie Fan, Wei Liang, Derui Ding, and Hui Yu. 2023. LACN: A lightweight attention-guided ConvNeXt network for low-light image enhancement. Engineering Applications of Artificial Intelligence, Vol. 117 (2023), 105632.

Digital Library

[7]

Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. 2019. Slowfast networks for video recognition. In Proceedings of the IEEE/CVF international conference on computer vision. 6202--6211.

[8]

Dan Guo, Kun Li, Bin Hu, Yan Zhang, and Meng Wang. 2024. Benchmarking Micro-action Recognition: Dataset, Methods, and Applications. IEEE Transactions on Circuits and Systems for Video Technology, Vol. 34, 7 (2024), 6238--6252. https://doi.org/10.1109/TCSVT.2024.3358415

Digital Library

[9]

Tao Jiang, Peng Lu, Li Zhang, Ningsheng Ma, Rui Han, Chengqi Lyu, Yining Li, and Kai Chen. 2023. Rtmpose: Real-time multi-person pose estimation based on mmpose. arXiv preprint arXiv:2303.07399 (2023).

[10]

Bin Li, Xi Li, Zhongfei Zhang, and Fei Wu. 2019. Spatio-temporal graph routing for skeleton-based action recognition. In Proceedings of the AAAI conference on artificial intelligence, Vol. 33. 8561--8568.

Digital Library

[11]

Kun Li, Dan Guo, Pengyu Liu, Guoliang Chen, and Meng Wang. 2024. MMAD: Multi-label Micro-Action Detection in Videos. arXiv preprint arXiv:2407.05311 (2024).

[12]

Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Limin Wang, and Yu Qiao. 2022. UniFormerV2: Spatiotemporal Learning by Arming Image ViTs with Video UniFormer. arxiv: 2211.09552 [cs.CV]

[13]

Maosen Li, Siheng Chen, Xu Chen, Ya Zhang, Yanfeng Wang, and Qi Tian. 2019. Actional-structural graph convolutional networks for skeleton-based action recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 3595--3603.

[14]

Ji Lin, Chuang Gan, and Song Han. 2019. Tsm: Temporal shift module for efficient video understanding. In Proceedings of the IEEE/CVF international conference on computer vision. 7083--7093.

[15]

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision. 10012--10022.

[16]

Ze Liu, Jia Ning, Yue Cao, Yixuan Wei, Zheng Zhang, Stephen Lin, and Han Hu. 2021. Video Swin Transformer. arXiv preprint arXiv:2106.13230 (2021).

[17]

Camillo Lugaresi, Jiuqiang Tang, Hadon Nash, Chris McClanahan, Esha Uboweja, Michael Hays, Fan Zhang, Chuo-Ling Chang, Ming Guang Yong, Juhyun Lee, et al. 2019. Mediapipe: A framework for building perception pipelines. arXiv preprint arXiv:1906.08172 (2019).

[18]

Ke Sun, Bin Xiao, Dong Liu, and Jingdong Wang. 2019. Deep high-resolution representation learning for human pose estimation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 5693--5703.

[19]

Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri. 2018. A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 6450--6459.

[20]

Lei Wang, Piotr Koniusz, and Du Q Huynh. 2019. Hallucinating idt descriptors and i3d optical flow features for action recognition with cnns. In Proceedings of the IEEE/CVF international conference on computer vision. 8698--8708.

[21]

Sijie Yan, Yuanjun Xiong, and Dahua Lin. 2018. Spatial temporal graph convolutional networks for skeleton-based action recognition. In Proceedings of the AAAI conference on artificial intelligence, Vol. 32.

[22]

Huanyu Zhou, Qingjie Liu, and Yunhong Wang. 2023. Learning discriminative representations for skeleton based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10608--10617.

[23]

Wentao Zhu, Xiaoxuan Ma, Zhaoyang Liu, Libin Liu, Wayne Wu, and Yizhou Wang. 2023. Motionbert: A unified perspective on learning human motion representations. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 15085--15099.

Index Terms

Micro-Action Recognition via Hierarchical Fusion and Inference
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
        Activity recognition and understanding

Recommendations

3D Multi-poses Face Expression Recognition Based on Action Units
ITCC '19: Proceedings of the 2019 International Conference on Information Technology and Computer Communications

This paper presents a new 3D multi-poses facial expression recognition method based on action units. Firstly, automatic landmark detection is carried out according to the detected position of nose tips. Then a new feature descriptor is created to ...
Action recognition with spatio-temporal augmented descriptor and fusion method

Action recognition is one of the most popular fields of computer vision, and lots of efforts have been made to improve recognition accuracy. While multiple descriptors are extracted to represent action, the spatio-temporal information is lost. In order ...
Micro-expression recognition based on 3D flow convolutional neural network
Abstract
Micro-expression recognition (MER) is a growing field of research which is currently in its early stage of development. Unlike conventional macro-expressions, micro-expressions occur at a very short duration and are elicited in a spontaneous ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '24: Proceedings of the 32nd ACM International Conference on Multimedia

October 2024

11719 pages

ISBN:9798400706868

DOI:10.1145/3664647

General Chairs:
Jianfei Cai
Monash University, Australia
,
Mohan Kankanhalli
NUS, Singapore
,
Balakrishnan Prabhakaran
UT Dallas, USA
,
Susanne Boll
University of Oldenburg, Germany
,
Program Chairs:
Ramanathan Subramanian
University of Canberra & IIT Ropar, Australia
,
Liang Zheng
Australian National University, Australia
,
Vivek K. Singh
Rutgers University, USA
,
Pablo Cesar
Centrum Wiskunde & Informatica, Netherlands
,
Lexing Xie
Australian National University, Australia
,
Dong Xu
University of Hong Kong, Hong Kong

Copyright © 2024 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 October 2024

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

the Fundamental Research Funds for the Central Universities
the Open project of State Key Laboratory of CAD & CG at Zhejiang University
the Proof of Concept Foundation of Xidian University Hangzhou Institute of Technology
the National Natural Science Foundation of China
the Zhejiang Provincial Science and Technology Program in China

Conference

MM '24

Sponsor:

SIGMM

MM '24: The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne VIC, Australia

Acceptance Rates

MM '24 Paper Acceptance Rate 1,150 of 4,385 submissions, 26%;

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
115
Total Downloads

Downloads (Last 12 months)115
Downloads (Last 6 weeks)22

Reflects downloads up to 10 Feb 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten