research-article

RefineTAD: Learning Proposal-free Refinement for Temporal Action Detection

Authors:

Jie QinAuthors Info & Claims

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

Pages 135 - 143

https://doi.org/10.1145/3581783.3611872

Published: 27 October 2023 Publication History

Abstract

Temporal action detection (TAD) aims to localize the start and end frames of actions in untrimmed videos, which is a challenging task due to the similarity of adjacent frames and the ambiguity of action boundaries. Previous methods often generate coarse proposals first and then perform proposal-based refinement, which is coupled with prior action detectors and leads to proposal-oriented offsets. However, this paradigm increases the training difficulty of the TAD model and is heavily influenced by the quantity and quality of the proposals. To address the above issues, we decouple the refinement process from conventional TAD methods and propose a learnable, proposal-free refinement method for fine boundary localization, named RefineTAD. We first propose a multi-level refinement module to generate multi-scale boundary offsets, score offsets and boundary-aware probability at each time point based on the feature pyramid. Then, we propose an offset focusing strategy to progressively refine the predicted results of TAD models in a coarse-to-fine manner with our multi-scale offsets. We perform extensive experiments on three challenging datasets and demonstrate that our RefineTAD significantly improves the state-of-the-art TAD methods with minimal computational overhead.

References

[1]

Abien Fred Agarap. 2018. Deep learning using rectified linear units (relu). arXiv preprint arXiv:1803.08375 (2018).

[2]

Humam Alwassel, Silvio Giancola, and Bernard Ghanem. 2021. Tsp: Temporally-sensitive pretraining of video encoders for localization tasks. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 3173--3183.

[3]

Humam Alwassel, Fabian Caba Heilbron, Victor Escorcia, and Bernard Ghanem. 2018. Diagnosing error in temporal action detectors. In Proceedings of the European conference on computer vision (ECCV). 256--272.

Digital Library

[4]

Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. 2016. Layer normalization. arXiv preprint arXiv:1607.06450 (2016).

[5]

Navaneeth Bodla, Bharat Singh, Rama Chellappa, and Larry S Davis. 2017. Soft-NMS-improving object detection with one line of code. In Proceedings of the IEEE international conference on computer vision. 5561--5569.

[6]

Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. 2020. End-to-end object detection with trans- formers. In Computer Vision-ECCV 2020: 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part I 16. Springer, 213--229.

Digital Library

[7]

João Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6299--6308.

[8]

Feng Cheng and Gedas Bertasius. 2022. TallFormer: Temporal Action Localization with a Long-Memory Transformer. In Computer Vision-ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part XXXIV. Springer, 503--521.

[9]

François Chollet. 2017. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1251--1258.

[10]

Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Antonino Furnari, Evangelos Kazakos, Jian Ma, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, et al. 2022. Rescaling egocentric vision: Collection, pipeline and challenges for epic-kitchens-100. International Journal of Computer Vision (2022), 1--23.

[11]

Victor Escorcia, Fabian Caba Heilbron, Juan Carlos Niebles, and Bernard Ghanem. 2016. Daps: Deep action proposals for action understanding. In Computer Vision-ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part III 14. Springer, 768--784.

[12]

Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. 2019. Slow-fast networks for video recognition. In Proceedings of the IEEE/CVF international conference on computer vision. 6202--6211.

[13]

Ross Girshick. 2015. Fast r-cnn. In Proceedings of the IEEE international conference on computer vision. 1440--1448.

Digital Library

[14]

Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. 2015. Activitynet: A large-scale video benchmark for human activity understanding. In 2015 IEEE conference on computer vision and pattern recognition (CVPR). IEEE, 961--970.

[15]

Y.-G. Jiang, J. Liu, A. Roshan Zamir, G. Toderici, I. Laptev, M. Shah, and R. Sukthankar. 2014. THUMOS Challenge: Action Recognition with a Large Number of Classes. http://crcv.ucf.edu/THUMOS14/.

[16]

Xin Li, Tianwei Lin, Xiao Liu, Wangmeng Zuo, Chao Li, Xiang Long, Dongliang He, Fu Li, Shilei Wen, and Chuang Gan. 2020. Deep concept-wise temporal convolutional networks for action localization. In Proceedings of the 28th ACM International Conference on Multimedia. 4004--4012.

Digital Library

[17]

Chuming Lin, Jian Li, Yabiao Wang, Ying Tai, Donghao Luo, Zhipeng Cui, Chengjie Wang, Jilin Li, Feiyue Huang, and Rongrong Ji. 2020. Fast learning of temporal action proposal via dense boundary generator. In Proceedings of the AAAI conference on artificial intelligence, Vol. 34. 11499--11506.

[18]

Chuming Lin, Chengming Xu, Donghao Luo, Yabiao Wang, Ying Tai, Chengjie Wang, Jilin Li, Feiyue Huang, and Yanwei Fu. 2021. Learning salient boundary feature for anchor-free temporal action localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3320--3329.

[19]

Tianwei Lin, Xiao Liu, Xin Li, Errui Ding, and Shilei Wen. 2019. Bmn: Boundary-matching network for temporal action proposal generation. In Proceedings of the IEEE/CVF international conference on computer vision. 3889--3898.

[20]

Tianwei Lin, Xu Zhao, Haisheng Su, Chongjing Wang, and Ming Yang. 2018. Bsn: Boundary sensitive network for temporal action proposal generation. In Proceedings of the European conference on computer vision (ECCV). 3--19.

Digital Library

[21]

Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. 2017. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision. 2980--2988.

[22]

Qinying Liu and Zilei Wang. 2020. Progressive boundary refinement network for temporal action detection. In Proceedings of the AAAI conference on artificial intelligence, Vol. 34. 11612--11619.

[23]

Xiaolong Liu, Song Bai, and Xiang Bai. 2022. An empirical study of end-to-end temporal action detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 20010--20019.

[24]

Xiaolong Liu, Yao Hu, Song Bai, Fei Ding, Xiang Bai, and Philip HS Torr. 2021. Multi-shot temporal event localization: a benchmark. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12596--12606.

[25]

Xiaolong Liu, Qimeng Wang, Yao Hu, Xu Tang, Shiwei Zhang, Song Bai, and Xiang Bai. 2022. End-to-end temporal action detection with transformer. IEEE Transactions on Image Processing 31 (2022), 5427--5441.

Digital Library

[26]

Ilya Loshchilov and Frank Hutter. 2016. Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983 (2016).

[27]

Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017).

[28]

Zhiwu Qing, Haisheng Su, Weihao Gan, Dongliang Wang, Wei Wu, Xiang Wang, Yu Qiao, Junjie Yan, Changxin Gao, and Nong Sang. 2021. Temporal context aggregation network for temporal action proposal refinement. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 485--494.

[29]

Haonan Qiu, Yingbin Zheng, Hao Ye, Yao Lu, Feng Wang, and Liang He. 2018. Precise temporal action localization by evolving temporal proposals. In Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval. 388--396.

Digital Library

[30]

Yifan Ren, Xing Xu, Fumin Shen, Zheng Wang, Yang Yang, and Heng Tao Shen. 2021. Multi-scale dynamic network for temporal action detection. In Proceedings of the 2021 international conference on multimedia retrieval. 267--275.

Digital Library

[31]

Yifan Ren, Xing Xu, Fumin Shen, Yazhou Yao, and Huimin Lu. 2021. CAA: Candidate-Aware Aggregation for Temporal Action Detection. In Proceedings of the 29th ACM International Conference on Multimedia. 4930--4938.

Digital Library

[32]

Dingfeng Shi, Yujie Zhong, Qiong Cao, Lin Ma, Jia Li, and Dacheng Tao. 2023. TriDet: Temporal Action Detection with Relative Boundary Modeling. arXiv preprint arXiv:2303.07347 (2023).

[33]

Dingfeng Shi, Yujie Zhong, Qiong Cao, Jing Zhang, Lin Ma, Jia Li, and Dacheng Tao. 2022. React: Temporal action detection with relational queries. In Computer Vision-ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part X. Springer, 105--121.

[34]

Limin Wang, Yuanjun Xiong, Dahua Lin, and Luc Van Gool. 2017. UntrimmedNets for Weakly Supervised Action Recognition and Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35]

Mengmeng Xu, Chen Zhao, David S Rojas, Ali Thabet, and Bernard Ghanem. 2020. G-tad: Sub-graph localization for temporal action detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10156--10165.

[36]

Runhao Zeng, Wenbing Huang, Mingkui Tan, Yu Rong, Peilin Zhao, Junzhou Huang, and Chuang Gan. 2019. Graph Convolutional Networks for Temporal Action Localization. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV).

[37]

Runhao Zeng, Wenbing Huang, Mingkui Tan, Yu Rong, Peilin Zhao, Junzhou Huang, and Chuang Gan. 2019. Graph convolutional networks for temporal action localization. In Proceedings of the IEEE/CVF international conference on computer vision. 7094--7103.

[38]

Chengwei Zhang, Yunlu Xu, Zhanzhan Cheng, Yi Niu, Shiliang Pu, Fei Wu, and Futai Zou. 2019. Adversarial seeded sequence growing for weakly-supervised temporal action localization. In Proceedings of the 27th ACM international conference on multimedia. 738--746.

Digital Library

[39]

Chen-Lin Zhang, Jianxin Wu, and Yin Li. 2022. Actionformer: Localizing moments of actions with transformers. In Computer Vision-ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part IV. Springer, 492--510.

[40]

Chen Zhao, Ali K Thabet, and Bernard Ghanem. 2021. Video self-stitching graph network for temporal action localization. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 13658--13667.

[41]

Jia-Xing Zhong, Nannan Li, Weijie Kong, Tao Zhang, Thomas H Li, and Ge Li. 2018. Step-by-step erasion, one-by-one collection: a weakly supervised temporal action detector. In Proceedings of the 26th ACM international conference on Multimedia. 35--44.

Digital Library

Cited By

Lu CMak MLi RChi ZFu H(2024)Action Progression Networks for Temporal Action Detection in VideosIEEE Access10.1109/ACCESS.2024.345150312(126829-126844)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3451503

Index Terms

RefineTAD: Learning Proposal-free Refinement for Temporal Action Detection
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
        Activity recognition and understanding

Recommendations

Temporal Action Detection with Long Action Seam Mechanism
ICBEB 2018: Proceedings of the 2nd International Conference on Biomedical Engineering and Bioinformatics

Temporal action detection is a hot topic in action recognition field recently. In this paper, we propose a novel framework that can extract action segments from untrimmed videos, meanwhile predict the action category. In general, we introduce a cascaded ...
MBGNet:Multi-branch boundary generation network with temporal context aggregation for temporal action detection
Abstract
Temporal action detection is an important and fundamental video understanding task that aims to locate the temporal regions where human actions or events may occur and to identify the classes of actions in untrimmed videos. The main challenge of ...
BSN: Boundary Sensitive Network for Temporal Action Proposal Generation
Computer Vision – ECCV 2018
Abstract
Temporal action proposal generation is an important yet challenging problem, since temporal proposals with rich action content are indispensable for analysing real-world videos with long duration and high proportion irrelevant content. This ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

October 2023

9913 pages

ISBN:9798400701085

DOI:10.1145/3581783

General Chairs:
Abdulmotaleb El Saddik
University of Ottawa, Canada & MBZUAI, UAE
,
Tao Mei
HiDream.ai, China
,
Rita Cucchiara
University of Modena and Reggio Emilia, Italy
,
Program Chairs:
Marco Bertini
University of Florence, Italy
,
Diana Patricia Tobon Vallejo
Unversidad de Medellin, Colombia
,
Pradeep K. Atrey
University at Albany, State University of New York, USA
,
M. Shamim Hossain
M. Shamim Hossain (King Saud University, KSA

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 October 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Conference

MM '23

Sponsor:

SIGMM

MM '23: The 31st ACM International Conference on Multimedia

October 29 - November 3, 2023

Ottawa ON, Canada

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
993
Total Downloads

Downloads (Last 12 months)452
Downloads (Last 6 weeks)10

Reflects downloads up to 02 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Lu CMak MLi RChi ZFu H(2024)Action Progression Networks for Temporal Action Detection in VideosIEEE Access10.1109/ACCESS.2024.345150312(126829-126844)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3451503

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten