Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3606038.3616169acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

STAN: Spatial-Temporal Awareness Network for Temporal Action Detection

Published: 29 October 2023 Publication History

Abstract

In recent years, there have been significant advancements in the field of temporal action detection. However, few studies have focused on detecting actions in sporting events. In this context, the MMSports 2023 cricket bowl release challenge aims to identify the bowl release action by segmenting untrimmed videos. To achieve this, we propose a novel cricket bowl release detection framework based on Spatial-Temporal Awareness Network (STAN) which mainly consists of three modules: the spatial feature extraction module (SFEM), the temporal feature extraction module (TFEM), and the classification module (CM). Specifically, we first adopt ResNet to extract the spatial features from videos in SFEM. Then, the TFEM is designed to aggregate temporal features using Bi-LSTM to obtain spatial-temporal features. Afterward, the CM converts the spatial-temporal features into action category probabilities to localize the action segments. Besides, we introduce the weighted binary cross entropy loss to solve the data imbalance problem in cricket bowl release detection. Finally, the experiments show that our proposed STAN achieves competitive performance in 1st place with a PQ score of 0.643 on the cricket bowl release challenge. The code is also publicly available at https://github.com/lmhr/STAN.

References

[1]
He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (2016), pp. 770--778.
[2]
Istasse, M., Somers, V., Elancheliyan, P., De, J., and Zambrano, D. Deepsportradar v2: a multi-sport computer vision dataset for sport understandings. In 6th Int. Workshop on Multimedia Content Analysis in Sports (MMSports'23) @ ACM Multimedia 2023 (2023).
[3]
Leng, W., Zhao, S., Zhang, Y., Liu, S., Mao, X., Wang, H., Xu, T., and Chen, E. Abpn: Apex and boundary perception network for micro-and macro-expression spotting. In Proceedings of the 30th ACM International Conference on Multimedia (2022), pp. 7160--7164.
[4]
Li, B., Liu, Y., and Wang, X. Gradient harmonized single-stage detector. In Proceedings of the AAAI conference on artificial intelligence (2019), vol. 33, pp. 8577-- 8584.
[5]
Lin, T., Zhao, X., and Shou, Z. Single shot temporal action detection. In Proceedings of the 25th ACMinternational conference on Multimedia (2017), pp. 988-- 996.
[6]
Lin, T.-Y., Goyal, P., Girshick, R., He, K., and Dollar, P. Focal loss for dense object detection. In 2017 IEEE International Conference on Computer Vision (ICCV) (Oct 2017).
[7]
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al. Imagenet large scale visual recognition challenge. International journal of computer vision 115 (2015), 211--252.
[8]
Shou, Z.,Wang, D., and Chang, S.-F. Temporal action localization in untrimmed videos via multi-stage cnns. In Proceedings of the IEEE conference on computer vision and pattern recognition (2016), pp. 1049--1058.
[9]
Tong, Z., Song, Y., Wang, J., and Wang, L. Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training. Advances in neural information processing systems 35 (2022), 10078--10093.
[10]
Wang, L., Huang, B., Zhao, Z., Tong, Z., He, Y., Wang, Y., Wang, Y., and Qiao, Y. Videomae v2: Scaling video masked autoencoders with dual masking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2023), pp. 14549--14560.
[11]
Wang, Y., Li, K., Li, Y., He, Y., Huang, B., Zhao, Z., Zhang, H., Xu, J., Liu, Y., Wang, Z., et al. Internvideo: General video foundation models via generative and discriminative learning. arXiv preprint arXiv:2212.03191 (2022).
[12]
Xu, H., Ghosh, G., Huang, P.-Y., Okhonko, D., Aghajanyan, A., Metze, F., Zettlemoyer, L., and Feichtenhofer, C. Videoclip: Contrastive pre-training for zero-shot video-text understanding. arXiv preprint arXiv:2109.14084 (2021).
[13]
Zhang, C.-L., Wu, J., and Li, Y. Actionformer: Localizing moments of actions with transformers. In European Conference on Computer Vision (2022), Springer, pp. 492--510.
[14]
Zhao, S., Tang, H., Liu, S., Zhang, Y., Wang, H., Xu, T., Chen, E., and Guan, C. Me-plan: A deep prototypical learning with local attention network for dynamic micro-expression recognition. Neural Networks 153 (2022), 427--443.
[15]
Zhao, S., Tao, H., Zhang, Y., Xu, T., Zhang, K., Hao, Z., and Chen, E. A two-stage 3d cnn based learning method for spontaneous micro-expression recognition. Neurocomputing 448 (2021), 276--289.
[16]
Zhao, Y., Xiong, Y., Wang, L., Wu, Z., Tang, X., and Lin, D. Temporal action detection with structured segment networks. In Proceedings of the IEEE international conference on computer vision (2017), pp. 2914--2923.
[17]
Zhong, Z., Cui, J., Liu, S., and Jia, J. Improving calibration for long-tailed recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (2021), pp. 16489--16498.

Cited By

View all
  • (2024)Quantifying NBA Shot Quality: A Deep Network ApproachProceedings of the 7th ACM International Workshop on Multimedia Content Analysis in Sports10.1145/3689061.3689068(91-95)Online publication date: 28-Oct-2024

Index Terms

  1. STAN: Spatial-Temporal Awareness Network for Temporal Action Detection

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MMSports '23: Proceedings of the 6th International Workshop on Multimedia Content Analysis in Sports
    October 2023
    174 pages
    ISBN:9798400702693
    DOI:10.1145/3606038
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 29 October 2023

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. bi-lstm
    2. data imbalance
    3. temporal action detection

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    MM '23
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 29 of 49 submissions, 59%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)115
    • Downloads (Last 6 weeks)5
    Reflects downloads up to 09 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Quantifying NBA Shot Quality: A Deep Network ApproachProceedings of the 7th ACM International Workshop on Multimedia Content Analysis in Sports10.1145/3689061.3689068(91-95)Online publication date: 28-Oct-2024

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media