research-article

STAN: Spatial-Temporal Awareness Network for Temporal Action Detection

Authors:

Enhong ChenAuthors Info & Claims

MMSports '23: Proceedings of the 6th International Workshop on Multimedia Content Analysis in Sports

Pages 161 - 165

https://doi.org/10.1145/3606038.3616169

Published: 29 October 2023 Publication History

Abstract

In recent years, there have been significant advancements in the field of temporal action detection. However, few studies have focused on detecting actions in sporting events. In this context, the MMSports 2023 cricket bowl release challenge aims to identify the bowl release action by segmenting untrimmed videos. To achieve this, we propose a novel cricket bowl release detection framework based on Spatial-Temporal Awareness Network (STAN) which mainly consists of three modules: the spatial feature extraction module (SFEM), the temporal feature extraction module (TFEM), and the classification module (CM). Specifically, we first adopt ResNet to extract the spatial features from videos in SFEM. Then, the TFEM is designed to aggregate temporal features using Bi-LSTM to obtain spatial-temporal features. Afterward, the CM converts the spatial-temporal features into action category probabilities to localize the action segments. Besides, we introduce the weighted binary cross entropy loss to solve the data imbalance problem in cricket bowl release detection. Finally, the experiments show that our proposed STAN achieves competitive performance in 1st place with a PQ score of 0.643 on the cricket bowl release challenge. The code is also publicly available at https://github.com/lmhr/STAN.

References

[1]

He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (2016), pp. 770--778.

[2]

Istasse, M., Somers, V., Elancheliyan, P., De, J., and Zambrano, D. Deepsportradar v2: a multi-sport computer vision dataset for sport understandings. In 6th Int. Workshop on Multimedia Content Analysis in Sports (MMSports'23) @ ACM Multimedia 2023 (2023).

Digital Library

[3]

Leng, W., Zhao, S., Zhang, Y., Liu, S., Mao, X., Wang, H., Xu, T., and Chen, E. Abpn: Apex and boundary perception network for micro-and macro-expression spotting. In Proceedings of the 30th ACM International Conference on Multimedia (2022), pp. 7160--7164.

Digital Library

[4]

Li, B., Liu, Y., and Wang, X. Gradient harmonized single-stage detector. In Proceedings of the AAAI conference on artificial intelligence (2019), vol. 33, pp. 8577-- 8584.

Digital Library

[5]

Lin, T., Zhao, X., and Shou, Z. Single shot temporal action detection. In Proceedings of the 25th ACMinternational conference on Multimedia (2017), pp. 988-- 996.

Digital Library

[6]

Lin, T.-Y., Goyal, P., Girshick, R., He, K., and Dollar, P. Focal loss for dense object detection. In 2017 IEEE International Conference on Computer Vision (ICCV) (Oct 2017).

[7]

Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al. Imagenet large scale visual recognition challenge. International journal of computer vision 115 (2015), 211--252.

Digital Library

[8]

Shou, Z.,Wang, D., and Chang, S.-F. Temporal action localization in untrimmed videos via multi-stage cnns. In Proceedings of the IEEE conference on computer vision and pattern recognition (2016), pp. 1049--1058.

[9]

Tong, Z., Song, Y., Wang, J., and Wang, L. Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training. Advances in neural information processing systems 35 (2022), 10078--10093.

[10]

Wang, L., Huang, B., Zhao, Z., Tong, Z., He, Y., Wang, Y., Wang, Y., and Qiao, Y. Videomae v2: Scaling video masked autoencoders with dual masking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2023), pp. 14549--14560.

[11]

Wang, Y., Li, K., Li, Y., He, Y., Huang, B., Zhao, Z., Zhang, H., Xu, J., Liu, Y., Wang, Z., et al. Internvideo: General video foundation models via generative and discriminative learning. arXiv preprint arXiv:2212.03191 (2022).

[12]

Xu, H., Ghosh, G., Huang, P.-Y., Okhonko, D., Aghajanyan, A., Metze, F., Zettlemoyer, L., and Feichtenhofer, C. Videoclip: Contrastive pre-training for zero-shot video-text understanding. arXiv preprint arXiv:2109.14084 (2021).

[13]

Zhang, C.-L., Wu, J., and Li, Y. Actionformer: Localizing moments of actions with transformers. In European Conference on Computer Vision (2022), Springer, pp. 492--510.

Digital Library

[14]

Zhao, S., Tang, H., Liu, S., Zhang, Y., Wang, H., Xu, T., Chen, E., and Guan, C. Me-plan: A deep prototypical learning with local attention network for dynamic micro-expression recognition. Neural Networks 153 (2022), 427--443.

Digital Library

[15]

Zhao, S., Tao, H., Zhang, Y., Xu, T., Zhang, K., Hao, Z., and Chen, E. A two-stage 3d cnn based learning method for spontaneous micro-expression recognition. Neurocomputing 448 (2021), 276--289.

[16]

Zhao, Y., Xiong, Y., Wang, L., Wu, Z., Tang, X., and Lin, D. Temporal action detection with structured segment networks. In Proceedings of the IEEE international conference on computer vision (2017), pp. 2914--2923.

[17]

Zhong, Z., Cui, J., Liu, S., and Jia, J. Improving calibration for long-tailed recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (2021), pp. 16489--16498.

Cited By

Index Terms

STAN: Spatial-Temporal Awareness Network for Temporal Action Detection
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks

Recommendations

MBGNet:Multi-branch boundary generation network with temporal context aggregation for temporal action detection
Abstract
Temporal action detection is an important and fundamental video understanding task that aims to locate the temporal regions where human actions or events may occur and to identify the classes of actions in untrimmed videos. The main challenge of ...
A Temporal Action Detection Model With Feature Pyramid Network
EITCE '20: Proceedings of the 2020 4th International Conference on Electronic Information Technology and Computer Engineering

To find out all actions included in an untrimmed video, temporal action detection localizes the starting and ending of each action, and identify their categories, simultaneously. Different with trimmed video which always involves a single action ...
Timeception Single Shot Action Detector: A Single-Stage Method for Temporal Action Detection
Image and Graphics
Abstract
Temporal action detection is used to detect the start and end times and classify the potentially specific actions in a video. Prior studies in temporal action detection perform weak because they can not fully understand the whole input video's ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MMSports '23: Proceedings of the 6th International Workshop on Multimedia Content Analysis in Sports

October 2023

174 pages

ISBN:9798400702693

DOI:10.1145/3606038

Program Chairs:
Rainer Lienhart
University of Augsburg
,
Thomas B. Moeslund
Aalborg University
,
Hideo Saito
Keio University

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 29 October 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Conference

MM '23

Sponsor:

SIGMM

MM '23: The 31st ACM International Conference on Multimedia

October 29, 2023

Ottawa ON, Canada

Acceptance Rates

Overall Acceptance Rate 29 of 49 submissions, 59%

Upcoming Conference

MM '24

Sponsor:
sigmm

The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne , VIC , Australia

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
127
Total Downloads

Downloads (Last 12 months)127
Downloads (Last 6 weeks)8

Reflects downloads up to 03 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents