Online Temporal Action Localization with Memory-Augmented Transformer

Song, Youngkil; Kim, Dongkeun; Cho, Minsu; Kwak, Suha

doi:10.1007/978-3-031-72655-2_5

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15077))

Included in the following conference series:

European Conference on Computer Vision

62 Accesses

Abstract

Online temporal action localization (On-TAL) is the task of identifying multiple action instances given a streaming video. Since existing methods take as input only a video segment of fixed size per iteration, they are limited in considering long-term context and require tuning the segment size carefully. To overcome these limitations, we propose memory-augmented transformer (MATR). MATR utilizes the memory queue that selectively preserves the past segment features, allowing to leverage long-term context for inference. We also propose a novel action localization method that observes the current input segment to predict the end time of the ongoing action and accesses the memory queue to estimate the start time of the action. Our method outperformed existing methods on two datasets, THUMOS14 and MUSES, surpassing not only TAL methods in the online setting but also some offline TAL methods.

Y. Song and D. Kim—Equal contribution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Weakly-supervised temporal action localization: a survey

Article 07 March 2022

A Sliding Window Scheme for Online Temporal Action Localization

ActionSwitch: Class-Agnostic Detection of Simultaneous Actions in Streaming Videos

References

Buch, S., Escorcia, V., Shen, C., Ghanem, B., Carlos Niebles, J.: SST: single-stream temporal action proposals. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2911–2920 (2017)
Google Scholar
Cao, S., Luo, W., Wang, B., Zhang, W., Ma, L.: E2E-load: end-to-end long-form online action detection. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2023)
Google Scholar
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 213–229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_13
Chapter Google Scholar
Carreira, J., Zisserman, A.: Quo vadis, action recognition? A new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
Google Scholar
Chen, P., Gan, C., Shen, G., Huang, W., Zeng, R., Tan, M.: Relation attention for temporal action localization. IEEE Trans. Multimed. 22(10), 2723–2733 (2019)
Article Google Scholar
Cheng, F., Bertasius, G.: TallFormer: temporal action localization with a long-memory transformer. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13694, pp. 503–521. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19830-4_29
Chapter Google Scholar
De Geest, R., Gavves, E., Ghodrati, A., Li, Z., Snoek, C., Tuytelaars, T.: Online action detection. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9909, pp. 269–284. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46454-1_17
Chapter Google Scholar
Dosovitskiy, A., et al.: An image is worth 16 $\times $ 16 words: transformers for image recognition at scale. In: Proceedings of the International Conference on Learning Representations (ICLR) (2021). https://openreview.net/forum?id=YicbFdNTTy
Escorcia, V., Caba Heilbron, F., Niebles, J.C., Ghanem, B.: DAPs: deep action proposals for action understanding. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9907, pp. 768–784. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46487-9_47
Chapter Google Scholar
Gao, J., Yang, Z., Chen, K., Sun, C., Nevatia, R.: Turn tap: temporal unit regression network for temporal action proposals. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 3628–3636 (2017)
Google Scholar
Gao, M., Xu, M., Davis, L.S., Socher, R., Xiong, C.: Startnet: online detection of action start in untrimmed videos. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 5542–5551 (2019)
Google Scholar
Heilbron, F.C., Niebles, J.C., Ghanem, B.: Fast temporal activity proposals for efficient detection of human actions in untrimmed videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1914–1923 (2016)
Google Scholar
Jiang, Y.G., et al.: THUMOS challenge: action recognition with a large number of classes (2014). http://crcv.ucf.edu/THUMOS14/
Kang, H., Kim, K., Ko, Y., Kim, S.J.: CAG-QIL: context-aware actionness grouping via Q imitation learning for online temporal action localization. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 13729–13738 (2021)
Google Scholar
Kim, Y.H., Kang, H., Kim, S.J.: A sliding window scheme for online temporal action localization. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13694, pp. 653–669. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19830-4_37
Chapter Google Scholar
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: Proceedings of the International Conference on Learning Representations (ICLR) (2015)
Google Scholar
Lin, C., et al.: Learning salient boundary feature for anchor-free temporal action localization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3320–3329 (2021)
Google Scholar
Lin, T., Liu, X., Li, X., Ding, E., Wen, S.: BMN: boundary-matching network for temporal action proposal generation. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 3889–3898 (2019)
Google Scholar
Lin, T., Zhao, X., Su, H., Wang, C., Yang, M.: BSN: boundary sensitive network for temporal action proposal generation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11208, pp. 3–21. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01225-0_1
Chapter Google Scholar
Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 2980–2988 (2017)
Google Scholar
Liu, Q., Wang, Z.: Progressive boundary refinement network for temporal action detection. In: Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), vol. 34, pp. 11612–11619 (2020)
Google Scholar
Liu, X., Hu, Y., Bai, S., Ding, F., Bai, X., Torr, P.H.: Multi-shot temporal event localization: a benchmark. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 12596–12606 (2021)
Google Scholar
Liu, X., et al.: End-to-end temporal action detection with transformer. IEEE Trans. Image Process. 31, 5427–5441 (2022)
Article Google Scholar
Liu, Y., Ma, L., Zhang, Y., Liu, W., Chang, S.F.: Multi-granularity generator for temporal action proposal. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3604–3613 (2019)
Google Scholar
Long, F., Yao, T., Qiu, Z., Tian, X., Luo, J., Mei, T.: Gaussian temporal awareness networks for action localization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 344–353 (2019)
Google Scholar
Loshchilov, I., Hutter, F.: SGDR: stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983 (2016)
Shi, D., Zhong, Y., Cao, Q., Ma, L., Li, J., Tao, D.: Tridet: temporal action detection with relative boundary modeling. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 18857–18866 (2023)
Google Scholar
Shou, Z., Chan, J., Zareian, A., Miyazawa, K., Chang, S.F.: CDC: convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5734–5743 (2017)
Google Scholar
Shou, Z., et al.: Online detection of action start in untrimmed, streaming videos. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11207, pp. 551–568. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01219-9_33
Chapter Google Scholar
Shou, Z., Wang, D., Chang, S.F.: Temporal action localization in untrimmed videos via multi-stage CNNs. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1049–1058 (2016)
Google Scholar
Tang, T.N., Park, J., Kim, K., Sohn, K.: Simon: a simple framework for online temporal action localization. arXiv preprint arXiv:2211.04905 (2022)
Vaswani, A., et al.: Attention is all you need. In: Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), pp. 5998–6008 (2017)
Google Scholar
Wang, J., Chen, G., Huang, Y., Wang, L., Lu, T.: Memory-and-anticipation transformer for online action understanding. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 13824–13835 (2023)
Google Scholar
Wang, L., et al.: Temporal segment networks: towards good practices for deep action recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 20–36. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_2
Chapter Google Scholar
Wang, X., et al.: Oadtr: online action detection with transformers. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 7565–7575 (2021)
Google Scholar
Wu, C.Y., et al.: Memvit: memory-augmented multiscale vision transformer for efficient long-term video recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 13587–13597 (2022)
Google Scholar
Wu, Y., et al.: Rethinking classification and localization for object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10186–10195 (2020)
Google Scholar
Xu, M., Perez Rua, J.M., Zhu, X., Ghanem, B., Martinez, B.: Low-fidelity video encoder optimization for temporal action localization. In: Proceedings of Advances in Neural Information Processing Systems (NeurIPS), vol. 34, pp. 9923–9935 (2021)
Google Scholar
Xu, M., Zhao, C., Rojas, D.S., Thabet, A., Ghanem, B.: G-TAD: sub-graph localization for temporal action detection. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10156–10165 (2020)
Google Scholar
Xu, M., Gao, M., Chen, Y.T., Davis, L.S., Crandall, D.J.: Temporal recurrent networks for online action detection. In: Proceedings of IEEE International Conference on Computer Vision (ICCV), pp. 5532–5541 (2019)
Google Scholar
Xu, M., et al.: Long short-term transformer for online action detection. In: Proceedings of Advances in Neural Information Processing Systems (NeurIPS), vol. 34, pp. 1086–1099 (2021)
Google Scholar
Yang, L., Han, J., Zhang, D.: Colar: effective and efficient online action detection by consulting exemplars. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3160–3169 (2022)
Google Scholar
Zeng, R., et al.: Graph convolutional networks for temporal action localization. In: Proceedings of IEEE International Conference on Computer Vision (ICCV), pp. 7094–7103 (2019)
Google Scholar
Zhang, C.L., Wu, J., Li, Y.: Actionformer: localizing moments of actions with transformers. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13664, pp. 492–510. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19772-7_29
Chapter Google Scholar
Zhao, C., Thabet, A.K., Ghanem, B.: Video self-stitching graph network for temporal action localization. In: Proceedings of IEEE International Conference on Computer Vision (ICCV), pp. 13658–13667 (2021)
Google Scholar
Zhao, Y., Krähenbühl, P.: Real-time online video detection with temporal smoothing transformers. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13694, pp. 485–502. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19830-4_28
Chapter Google Scholar
Zhao, Y., Xiong, Y., Wang, L., Wu, Z., Tang, X., Lin, D.: Temporal action detection with structured segment networks. In: Proceedings of IEEE International Conference on Computer Vision (ICCV), pp. 2914–2923 (2017)
Google Scholar
Zheng, Z., Wang, P., Liu, W., Li, J., Ye, R., Ren, D.: Distance-IoU loss: faster and better learning for bounding box regression. In: Proceedings of AAAI Conference on Artificial Intelligence (AAAI), vol. 34, pp. 12993–13000 (2020)
Google Scholar
Zhu, Z., Tang, W., Wang, L., Zheng, N., Hua, G.: Enriching local and global contexts for temporal action localization. In: Proceedings of IEEE International Conference on Computer Vision (ICCV), pp. 13516–13525 (2021)
Google Scholar

Download references

Acknowledgement

This work was supported by the NRF grant and the IITP grant funded by Ministry of Science and ICT, Korea (RS-2019-II191906, RS-2021-II212068, RS-2022-II220290, NRF-2021R1A2C3012728).

Author information

Authors and Affiliations

Pohang University of Science and Technology (POSTECH), Pohang, South Korea
Youngkil Song, Dongkeun Kim, Minsu Cho & Suha Kwak

Authors

Youngkil Song
View author publications
You can also search for this author in PubMed Google Scholar
Dongkeun Kim
View author publications
You can also search for this author in PubMed Google Scholar
Minsu Cho
View author publications
You can also search for this author in PubMed Google Scholar
Suha Kwak
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Youngkil Song .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Germany
Stefan Roth
Princeton University, Princeton, NJ, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 1175 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Song, Y., Kim, D., Cho, M., Kwak, S. (2025). Online Temporal Action Localization with Memory-Augmented Transformer. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15077. Springer, Cham. https://doi.org/10.1007/978-3-031-72655-2_5

Download citation

DOI: https://doi.org/10.1007/978-3-031-72655-2_5
Published: 06 December 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72654-5
Online ISBN: 978-3-031-72655-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Online Temporal Action Localization with Memory-Augmented Transformer

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Weakly-supervised temporal action localization: a survey

A Sliding Window Scheme for Online Temporal Action Localization

ActionSwitch: Class-Agnostic Detection of Simultaneous Actions in Streaming Videos

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 1175 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Online Temporal Action Localization with Memory-Augmented Transformer

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Weakly-supervised temporal action localization: a survey

A Sliding Window Scheme for Online Temporal Action Localization

ActionSwitch: Class-Agnostic Detection of Simultaneous Actions in Streaming Videos

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 1175 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation