Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Proposal Complementary Action Detection

Published: 21 June 2020 Publication History
  • Get Citation Alerts
  • Abstract

    Temporal action detection not only requires correct classification but also needs to detect the start and end times of each action accurately. However, traditional approaches always employ sliding windows or actionness to predict the actions, and it is different to train to model with sliding windows or actionness by end-to-end means. In this article, we attempt a different idea to detect the actions end-to-end, which can calculate the probabilities of actions directly through one network as one part of the results. We present PCAD, a novel proposal complementary action detector to deal with video streams under continuous, untrimmed conditions. Our approach first uses a simple fully 3D convolutional network to encode the video streams and then generates candidate temporal proposals for activities by using anchor segments. To generate more precise proposals, we also design a boundary proposal network to offer some complementary information for the candidate proposals. Finally, we learn an efficient classifier to classify the generated proposals into different activities and refine their temporal boundaries at the same time. Our model can achieve end-to-end training by jointly optimizing classification loss and regression loss. When evaluating on the THUMOS’14 detection benchmark, PCAD achieves state-of-the-art performance in high-speed models.

    References

    [1]
    Shyamal Buch, Victor Escorcia, Bernard Ghanem, Li Fei-Fei, and Juan Carlos Niebles. 2017. End-to-end, single-stream temporal action detection in untrimmed videos. In Proceedings of the British Machine Vision Conference (BMVC’17), Vol. 2. 1--7.
    [2]
    Yu-Wei Chao, Sudheendra Vijayanarasimhan, Bryan Seybold, David A. Ross, Jia Deng, and Rahul Sukthankar. 2018. Rethinking the faster R-CNN architecture for temporal action localization. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’18). 1130--1139.
    [3]
    Xiyang Dai, Bharat Singh, Guyue Zhang, Larry S. Davis, and Yan Qiu Chen. 2017. Temporal context network for activity localization in videos. In Proceedings of the IEEE International Conference on Computer Vision. 5793--5802.
    [4]
    N. Dalal and B. Triggs. 2005. Histograms of oriented gradients for human detection. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition.
    [5]
    Yoav Freund and Robert E. Schapire. 1996. Experiments with a new boosting algorithm. In Proceedings of the 13th International Conference on Machine Learning (ICML’96). 148--156.
    [6]
    Jiyang Gao, Kan Chen, and Ram Nevatia. 2018. CTAP: Complementary temporal action proposal generation. In Proceedings of the European Conference on Computer Vision (ECCV’18). 68--83.
    [7]
    Jiyang Gao, Zhenheng Yang, Kan Chen, Chen Sun, and Ram Nevatia. 2017. TURN TAP: Temporal unit regression network for temporal action proposals. In Proceedings of the IEEE International Conference on Computer Vision. 3628--3636.
    [8]
    Jiyang Gao, Zhenheng Yang, and Ram Nevatia. 2017. Cascaded boundary regression for temporal action detection. In Proceedings of the British Machine Vision Conference (BMVC’17).
    [9]
    Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. 2016. Region-based convolutional networks for accurate object detection and segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 38, 1 (Jan. 2016), 142--158.
    [10]
    Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. 2017. Mask R-CNN. In Proceedings of the IEEE International Conference on Computer Vision. 2961--2969.
    [11]
    Y.-G. Jiang, J. Liu, A. Roshan Zamir, G. Toderici, I. Laptev, M. Shah, and R. Sukthankar. 2014. THUMOS Challenge 2014: Action Recognition with a Large Number of Classes. Retrieved May 14, 2020 from http://crcv.ucf.edu/THUMOS14/.
    [12]
    Andy Liaw and Matthew Wiener. 2002. Classification and regression by randomForest. R News 2, 3 (2002), 18--22.
    [13]
    Tianwei Lin, Xu Zhao, and Zheng Shou. 2017. Single shot temporal action detection. In Proceedings of the 25th ACM International Conference on Multimedia (MM’17). ACM, New York, NY, 988--996.
    [14]
    Tianwei Lin, Xu Zhao, Haisheng Su, Chongjing Wang, and Ming Yang. 2018. BSN: Boundary sensitive network for temporal action proposal generation. In Proceedings of the European Conference on Computer Vision (ECCV’18). 3--19.
    [15]
    TsungYi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollar. 2017. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision. 2980--2988.
    [16]
    Constantine Papageorgiou, Michael Oren, and Tomaso A. Poggio. 1998. A general framework for object detection. In Proceedings of the 6th International Conference on Computer Vision. 555.
    [17]
    Matti Pietikäinen. 2005. Image analysis with local binary patterns. In Proceedings of the Scandinavian Conference on Image Analysis. 115--118.
    [18]
    Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2017. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 39, 6 (June 2017), 1137--1149.
    [19]
    Zheng Shou, Jonathan Chan, Alireza Zareian, Kazuyuki Miyazawa, and Shih-Fu Chang. 2017. CDC: Convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5734--5743.
    [20]
    Zheng Shou, Dongang Wang, and Shih-Fu Chang. 2016. Temporal action localization in untrimmed videos via multi-stage CNNs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1049--1058.
    [21]
    Gunnar A. Sigurdsson, Gül Varol Varol, Xiaolong Wang, Ali Farhadi, Ivan Laptev, and Abhinav Gupta. 2016. Hollywood in homes: Crowdsourcing data collection for activity understanding. In Computer Vision—ECCV 2016. Lecture Notes in Computer Science, Vol. 9905. Springer, 510--526.
    [22]
    Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3D convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’15). IEEE, Los Alamitos, CA, 4489--4497.
    [23]
    Vladimir Vapnik. 2013. The Nature of Statistical Learning Theory. Springer Science 8 Business Media.
    [24]
    Huijuan Xu, Abir Das, and Kate Saenko. 2017. R-C3D: Region convolutional 3D network for temporal activity detection. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’17). 5783--5792.
    [25]
    Da Zhang, Xiyang Dai, Xin Wang, and Yuan-Fang Wang. 2018. S3D: Single shot multi-span detector via fully 3D convolutional network. In Proceedings of the British Machine Vision Conference (BMVC’18). 1--11.
    [26]
    Yue Zhao, Yuanjun Xiong, Limin Wang, Zhirong Wu, Xiaoou Tang, and Dahua Lin. 2017. Temporal action detection with structured segment networks. In Proceedings of the IEEE International Conference on Computer Vision. 2914--2923.

    Cited By

    View all
    • (2022)ProposalVLAD with Proposal-Intra Exploring for Temporal Action Proposal GenerationACM Transactions on Multimedia Computing, Communications, and Applications10.1145/357174719:3(1-18)Online publication date: 24-Nov-2022
    • (2022)Temporal Dropout for Weakly Supervised Action LocalizationACM Transactions on Multimedia Computing, Communications, and Applications10.1145/356782719:3(1-24)Online publication date: 7-Nov-2022
    • (2022)Weakly Supervised Text-based Actor-Action Video Segmentation by Clip-level Multi-instance LearningACM Transactions on Multimedia Computing, Communications, and Applications10.1145/351425019:1(1-22)Online publication date: 18-Jul-2022
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Multimedia Computing, Communications, and Applications
    ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 16, Issue 2s
    Special Issue on Smart Communications and Networking for Future Video Surveillance and Special Section on Extended MMSYS-NOSSDAV 2019 Best Papers
    April 2020
    291 pages
    ISSN:1551-6857
    EISSN:1551-6865
    DOI:10.1145/3407689
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Accepted: 01 September 2020
    Published: 21 June 2020
    Online AM: 07 May 2020
    Revised: 01 August 2019
    Received: 01 June 2019
    Published in TOMM Volume 16, Issue 2s

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. 3D convolutional network
    2. Temporal action detection
    3. boundary proposal network

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Funding Sources

    • Innovative Research Group Project of the National Natural Science Foundation of China

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)12
    • Downloads (Last 6 weeks)2

    Other Metrics

    Citations

    Cited By

    View all
    • (2022)ProposalVLAD with Proposal-Intra Exploring for Temporal Action Proposal GenerationACM Transactions on Multimedia Computing, Communications, and Applications10.1145/357174719:3(1-18)Online publication date: 24-Nov-2022
    • (2022)Temporal Dropout for Weakly Supervised Action LocalizationACM Transactions on Multimedia Computing, Communications, and Applications10.1145/356782719:3(1-24)Online publication date: 7-Nov-2022
    • (2022)Weakly Supervised Text-based Actor-Action Video Segmentation by Clip-level Multi-instance LearningACM Transactions on Multimedia Computing, Communications, and Applications10.1145/351425019:1(1-22)Online publication date: 18-Jul-2022
    • (2022)PcmNet: Position-sensitive context modeling network for temporal action localizationNeurocomputing10.1016/j.neucom.2022.08.040510(48-58)Online publication date: Oct-2022

    View Options

    Get Access

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media