research-article

Proposal Complementary Action Detection

Authors:

Qingming HuangAuthors Info & Claims

ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), Volume 16, Issue 2s

Article No.: 64, Pages 1 - 12

https://doi.org/10.1145/3361845

Published: 21 June 2020 Publication History

Abstract

Temporal action detection not only requires correct classification but also needs to detect the start and end times of each action accurately. However, traditional approaches always employ sliding windows or actionness to predict the actions, and it is different to train to model with sliding windows or actionness by end-to-end means. In this article, we attempt a different idea to detect the actions end-to-end, which can calculate the probabilities of actions directly through one network as one part of the results. We present PCAD, a novel proposal complementary action detector to deal with video streams under continuous, untrimmed conditions. Our approach first uses a simple fully 3D convolutional network to encode the video streams and then generates candidate temporal proposals for activities by using anchor segments. To generate more precise proposals, we also design a boundary proposal network to offer some complementary information for the candidate proposals. Finally, we learn an efficient classifier to classify the generated proposals into different activities and refine their temporal boundaries at the same time. Our model can achieve end-to-end training by jointly optimizing classification loss and regression loss. When evaluating on the THUMOS’14 detection benchmark, PCAD achieves state-of-the-art performance in high-speed models.

References

[1]

Shyamal Buch, Victor Escorcia, Bernard Ghanem, Li Fei-Fei, and Juan Carlos Niebles. 2017. End-to-end, single-stream temporal action detection in untrimmed videos. In Proceedings of the British Machine Vision Conference (BMVC’17), Vol. 2. 1--7.

[2]

Yu-Wei Chao, Sudheendra Vijayanarasimhan, Bryan Seybold, David A. Ross, Jia Deng, and Rahul Sukthankar. 2018. Rethinking the faster R-CNN architecture for temporal action localization. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’18). 1130--1139.

[3]

Xiyang Dai, Bharat Singh, Guyue Zhang, Larry S. Davis, and Yan Qiu Chen. 2017. Temporal context network for activity localization in videos. In Proceedings of the IEEE International Conference on Computer Vision. 5793--5802.

[4]

N. Dalal and B. Triggs. 2005. Histograms of oriented gradients for human detection. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[5]

Yoav Freund and Robert E. Schapire. 1996. Experiments with a new boosting algorithm. In Proceedings of the 13th International Conference on Machine Learning (ICML’96). 148--156.

Digital Library

[6]

Jiyang Gao, Kan Chen, and Ram Nevatia. 2018. CTAP: Complementary temporal action proposal generation. In Proceedings of the European Conference on Computer Vision (ECCV’18). 68--83.

[7]

Jiyang Gao, Zhenheng Yang, Kan Chen, Chen Sun, and Ram Nevatia. 2017. TURN TAP: Temporal unit regression network for temporal action proposals. In Proceedings of the IEEE International Conference on Computer Vision. 3628--3636.

[8]

Jiyang Gao, Zhenheng Yang, and Ram Nevatia. 2017. Cascaded boundary regression for temporal action detection. In Proceedings of the British Machine Vision Conference (BMVC’17).

[9]

Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. 2016. Region-based convolutional networks for accurate object detection and segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 38, 1 (Jan. 2016), 142--158.

Digital Library

[10]

Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. 2017. Mask R-CNN. In Proceedings of the IEEE International Conference on Computer Vision. 2961--2969.

[11]

Y.-G. Jiang, J. Liu, A. Roshan Zamir, G. Toderici, I. Laptev, M. Shah, and R. Sukthankar. 2014. THUMOS Challenge 2014: Action Recognition with a Large Number of Classes. Retrieved May 14, 2020 from http://crcv.ucf.edu/THUMOS14/.

[12]

Andy Liaw and Matthew Wiener. 2002. Classification and regression by randomForest. R News 2, 3 (2002), 18--22.

[13]

Tianwei Lin, Xu Zhao, and Zheng Shou. 2017. Single shot temporal action detection. In Proceedings of the 25th ACM International Conference on Multimedia (MM’17). ACM, New York, NY, 988--996.

Digital Library

[14]

Tianwei Lin, Xu Zhao, Haisheng Su, Chongjing Wang, and Ming Yang. 2018. BSN: Boundary sensitive network for temporal action proposal generation. In Proceedings of the European Conference on Computer Vision (ECCV’18). 3--19.

[15]

TsungYi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollar. 2017. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision. 2980--2988.

[16]

Constantine Papageorgiou, Michael Oren, and Tomaso A. Poggio. 1998. A general framework for object detection. In Proceedings of the 6th International Conference on Computer Vision. 555.

[17]

Matti Pietikäinen. 2005. Image analysis with local binary patterns. In Proceedings of the Scandinavian Conference on Image Analysis. 115--118.

Digital Library

[18]

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2017. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 39, 6 (June 2017), 1137--1149.

Digital Library

[19]

Zheng Shou, Jonathan Chan, Alireza Zareian, Kazuyuki Miyazawa, and Shih-Fu Chang. 2017. CDC: Convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5734--5743.

[20]

Zheng Shou, Dongang Wang, and Shih-Fu Chang. 2016. Temporal action localization in untrimmed videos via multi-stage CNNs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1049--1058.

[21]

Gunnar A. Sigurdsson, Gül Varol Varol, Xiaolong Wang, Ali Farhadi, Ivan Laptev, and Abhinav Gupta. 2016. Hollywood in homes: Crowdsourcing data collection for activity understanding. In Computer Vision—ECCV 2016. Lecture Notes in Computer Science, Vol. 9905. Springer, 510--526.

[22]

Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3D convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’15). IEEE, Los Alamitos, CA, 4489--4497.

Digital Library

[23]

Vladimir Vapnik. 2013. The Nature of Statistical Learning Theory. Springer Science 8 Business Media.

Digital Library

[24]

Huijuan Xu, Abir Das, and Kate Saenko. 2017. R-C3D: Region convolutional 3D network for temporal activity detection. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’17). 5783--5792.

[25]

Da Zhang, Xiyang Dai, Xin Wang, and Yuan-Fang Wang. 2018. S3D: Single shot multi-span detector via fully 3D convolutional network. In Proceedings of the British Machine Vision Conference (BMVC’18). 1--11.

[26]

Yue Zhao, Yuanjun Xiong, Limin Wang, Zhirong Wu, Xiaoou Tang, and Dahua Lin. 2017. Temporal action detection with structured segment networks. In Proceedings of the IEEE International Conference on Computer Vision. 2914--2923.

Cited By

Xing KLi TWang X(2022)ProposalVLAD with Proposal-Intra Exploring for Temporal Action Proposal GenerationACM Transactions on Multimedia Computing, Communications, and Applications10.1145/357174719:3(1-18)Online publication date: 24-Nov-2022
https://dl.acm.org/doi/10.1145/3571747
Xie CZhuang ZZhao SLiang S(2022)Temporal Dropout for Weakly Supervised Action LocalizationACM Transactions on Multimedia Computing, Communications, and Applications10.1145/356782719:3(1-24)Online publication date: 7-Nov-2022
https://dl.acm.org/doi/10.1145/3567827
Chen WLi GZhang XWang SLi LHuang Q(2022)Weakly Supervised Text-based Actor-Action Video Segmentation by Clip-level Multi-instance LearningACM Transactions on Multimedia Computing, Communications, and Applications10.1145/351425019:1(1-22)Online publication date: 18-Jul-2022
https://dl.acm.org/doi/10.1145/3514250
Show More Cited By

Index Terms

Proposal Complementary Action Detection
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks

Recommendations

Single Shot Temporal Action Detection
MM '17: Proceedings of the 25th ACM international conference on Multimedia

Temporal action detection is a very important yet challenging problem, since videos in real applications are usually long, untrimmed and contain multiple action instances. This problem requires not only recognizing action categories but also detecting ...
Read More
Temporal Action Detection with Long Action Seam Mechanism
ICBEB 2018: Proceedings of the 2nd International Conference on Biomedical Engineering and Bioinformatics

Temporal action detection is a hot topic in action recognition field recently. In this paper, we propose a novel framework that can extract action segments from untrimmed videos, meanwhile predict the action category. In general, we introduce a cascaded ...
Read More
CTAP: Complementary Temporal Action Proposal Generation
Computer Vision – ECCV 2018
Abstract
Temporal action proposal generation is an important task, akin to object proposals, temporal action proposals are intended to capture “clips” or temporal intervals in videos that are likely to contain an action. Previous methods can be divided to ...
Read More

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Multimedia Computing, Communications, and Applications

ACM Transactions on Multimedia Computing, Communications, and Applications Volume 16, Issue 2s

Special Issue on Smart Communications and Networking for Future Video Surveillance and Special Section on Extended MMSYS-NOSSDAV 2019 Best Papers

April 2020

291 pages

ISSN:1551-6857

EISSN:1551-6865

DOI:10.1145/3407689

Editor:
Alberto Del Bimbo
University of Firenze, Italy

Issue’s Table of Contents

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Accepted: 01 September 2020

Published: 21 June 2020

Online AM: 07 May 2020

Revised: 01 August 2019

Received: 01 June 2019

Published in TOMM Volume 16, Issue 2s

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

Innovative Research Group Project of the National Natural Science Foundation of China

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
106
Total Downloads

Downloads (Last 12 months)12
Downloads (Last 6 weeks)2

Other Metrics

View Author Metrics

Citations

Cited By

Xing KLi TWang X(2022)ProposalVLAD with Proposal-Intra Exploring for Temporal Action Proposal GenerationACM Transactions on Multimedia Computing, Communications, and Applications10.1145/357174719:3(1-18)Online publication date: 24-Nov-2022
https://dl.acm.org/doi/10.1145/3571747
Xie CZhuang ZZhao SLiang S(2022)Temporal Dropout for Weakly Supervised Action LocalizationACM Transactions on Multimedia Computing, Communications, and Applications10.1145/356782719:3(1-24)Online publication date: 7-Nov-2022
https://dl.acm.org/doi/10.1145/3567827
Chen WLi GZhang XWang SLi LHuang Q(2022)Weakly Supervised Text-based Actor-Action Video Segmentation by Clip-level Multi-instance LearningACM Transactions on Multimedia Computing, Communications, and Applications10.1145/351425019:1(1-22)Online publication date: 18-Jul-2022
https://dl.acm.org/doi/10.1145/3514250
Qin XZhao HLin GZeng HXu SLi X(2022)PcmNet: Position-sensitive context modeling network for temporal action localizationNeurocomputing10.1016/j.neucom.2022.08.040510(48-58)Online publication date: Oct-2022
https://doi.org/10.1016/j.neucom.2022.08.040

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Issue’s Table of Contents