Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Spatio-Temporal Deep Residual Network with Hierarchical Attentions for Video Event Recognition

Published: 21 June 2020 Publication History
  • Get Citation Alerts
  • Abstract

    Event recognition in surveillance video has gained extensive attention from the computer vision community. This process still faces enormous challenges due to the tiny inter-class variations that are caused by various facets, such as severe occlusion, cluttered backgrounds, and so forth. To address these issues, we propose a spatio-temporal deep residual network with hierarchical attentions (STDRN-HA) for video event recognition. In the first attention layer, the ResNet fully connected feature guides the Faster R-CNN feature to generate object-based attention (O-attention) for target objects. In the second attention layer, the O-attention further guides the ResNet convolutional feature to yield the holistic attention (H-attention) in order to perceive more details of the occluded objects and the global background. In the third attention layer, the attention maps use the deep features to obtain the attention-enhanced features. Then, the attention-enhanced features are input into a deep residual recurrent network, which is used to mine more event clues from videos. Furthermore, an optimized loss function named softmax-RC is designed, which embeds the residual block regularization and center loss to solve the vanishing gradient in a deep network and enlarge the distance between inter-classes. We also build a temporal branch to exploit the long- and short-term motion information. The final results are obtained by fusing the outputs of the spatial and temporal streams. Experiments on the four realistic video datasets, CCV, VIRAT 1.0, VIRAT 2.0, and HMDB51, demonstrate that the proposed method has good performance and achieves state-of-the-art results.

    References

    [1]
    Mohamed R. Amer and Sinisa Todorovic. 2012. Sum-product networks for modeling activities with stochastic structure. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 1314--1321.
    [2]
    Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 6077--6086.
    [3]
    Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? A new model and the kinetics dataset. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 6299--6308.
    [4]
    Wenqing Chu, Hongyang Xue, Chengwei Yao, and Deng Cai. 2019. Sparse coding guided spatiotemporal feature learning for abnormal event detection in large videos. IEEE Transactions on Multimedia 21, 1 (2019), 246--255.
    [5]
    Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. 2015. Long-term recurrent convolutional networks for visual recognition and description. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2625--2634.
    [6]
    Christoph Feichtenhofer, Axel Pinz, and Richard Wildes. 2016. Spatiotemporal residual networks for video action recognition. In Advances in Neural Information Processing Systems (NIPS). 3468--3476.
    [7]
    Christoph Feichtenhofer, Axel Pinz, and Richard P. Wildes. 2017. Spatiotemporal multiplier networks for video action recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 4768--4777.
    [8]
    Christoph Feichtenhofer, Axel Pinz, and Andrew Zisserman. 2016. Convolutional two-stream network fusion for video action recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 1933--1941.
    [9]
    Lianli Gao, Xiangpeng Li, Jingkuan Song, and Heng Tao Shen. 2019. Hierarchical LSTMs with adaptive attention for visual captioning. IEEE Transactions on Pattern Analysis and Machine Intelligence 42 (2019), 1112--1131.
    [10]
    Zhanning Gao, Gang Hua, Dongqing Zhang, Nebojsa Jojic, Le Wang, Jianru Xue, and Nanning Zheng. 2017. ER3: A unified framework for event retrieval, recognition and recounting. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2253--2262.
    [11]
    Rohit Girdhar, Deva Ramanan, Abhinav Gupta, Josef Sivic, and Bryan Russell. 2017. ActionVLAD: Learning spatio-temporal aggregation for action classification. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 971--980.
    [12]
    Alex Graves, Navdeep Jaitly, and Abdel-rahman Mohamed. 2013. Hybrid speech recognition with deep bidirectional LSTM. In The 2013 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU). 273--278.
    [13]
    Zhang Hao and Ngo Chong-Wah. 2019. A fine granularity object-level representation for event detection and recounting. IEEE Transactions on Multimedia 21, 6 (2019), 1450--1463.
    [14]
    Kensho Hara, Hirokatsu Kataoka, and Yutaka Satoh. 2018. Can spatiotemporal 3D CNNs retrace the history of 2D CNNs and ImageNet?. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 6546--6555.
    [15]
    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Computer Vision and Pattern Recognition (CVPR).
    [16]
    Chiori Hori, Huda Alamri, Jue Wang, Gordon Wichern, Takaaki Hori, Anoop Cherian, Tim K. Marks, Vincent Cartillier, Raphael Gontijo Lopes, Abhishek Das, Irfan Essa, Dhruv Batra, and Devi Parikh. 2019. End-to-end audio visual scene-aware dialog using multimodal attention-based video features. In The IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2352--2356.
    [17]
    Jingyi Hou, Xinxiao Wu, Feiwu Yu, and Yunde Jia. 2016. Multimedia event detection via deep spatial-temporal neural networks. In The IEEE International Conference on Multimedia and Expo. 1--6.
    [18]
    Zhong Ji, Kailin Xiong, Yanwei Pang, and Xuelong Li. 2019. Video summarization with attention-based encoder-decoder networks. IEEE Transactions on Circuits and Systems for Video Technology (2019).
    [19]
    Yu Gang Jiang, Chong Wah Ngo, and Jun Yang. 2007. Towards optimal bag-of-features for object categorization and semantic video retrieval. In ACM International Conference on Image and Video Retrieval. 494--501.
    [20]
    Yu Gang Jiang, Guangnan Ye, Shih Fu Chang, Daniel P. W. Ellis, and Alexander C. Loui. 2011. Consumer video understanding: A benchmark database and an evaluation of human and machine performance. In Proceedings of ACM International Conference on Multimedia Retrieval (ICMR).
    [21]
    Hou Jingyi, Wu Xinxiao, Sun Yuchao, and Jia Yunde. 2018. Content-attention representation by factorized action-scene network for action recognition. IEEE Transactions on Multimedia 20, 6 (2018), 1537--1547.
    [22]
    Hilde Kuehne, Hueihan Jhuang, Rainer Stiefelhagen, and Thomas Serre. 2011. HMDB51: A large video database for human motion recognition. In The IEEE International Conference on Computer Vision (ICCV). 2556--2563.
    [23]
    N. Kumaran, A. Vadivel, and S. Saravana Kumar. 2018. Recognition of human actions using CNN-GWO: A novel modeling of CNN for enhancement of classification performance. Multimedia Tools 8 Applications 77, 18 (2018), 23115--23147.
    [24]
    Inwoong Lee, Doyoung Kim, Seoungyoon Kang, and Sanghoon Lee. 2017. Ensemble deep learning for skeleton-based action recognition using temporal sliding LSTM networks. In The IEEE International Conference on Computer Vision (ICCV).
    [25]
    Chao Li, Jiewei Cao, Zi Huang, Lei Zhu, and Heng Tao Shen. 2017. Leveraging weak semantic relevance for complex video event classification. In The IEEE International Conference on Computer Vision (ICCV). 3647--3656.
    [26]
    Chao Li, Zi Huang, Yang Yang, Jiewei Cao, Xiaoshuai Sun, and Heng Tao Shen. 2017. Hierarchical latent concept discovery for video event detection. IEEE Transactions on Image Processing 26, 5 (2017), 2149--2162.
    [27]
    Yonggang Li, Zhaohui Wang, Xiaoyi Wan, Husheng Dong, Shengrong Gong, Chunping Liu, Yi Ji, and Rong Zhu. 2018. Deep residual dual unidirectional DLSTM for video event recognition with spatial-temporal consistency. Chinese Journal of Computers 41, 12 (2018), 2852--2866.
    [28]
    Zhenyang Li, Kirill Gavrilyuk, Efstratios Gavves, Mihir Jain, and Cees G. M. Snoek. 2018. VideoLSTM convolves, attends and flows for action recognition. Computer Vision and Image Understanding 166 (2018), 41--50.
    [29]
    Jun Liu, Gang Wang, Ping Hu, Ling-Yu Duan, and Alex C. Kot. 2017. Global context-aware attention LSTM networks for 3D action recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
    [30]
    Xiang Long, Chuang Gan, Gerard de Melo, Xiao Liu, Yandong Li, Fu Li, and Shilei Wen. 2018. Multimodal keyless attention fusion for video classification. In AAAI.
    [31]
    Xiang Long, Chuang Gan, Gerard de Melo, Jiajun Wu, Xiao Liu, and Shilei Wen. 2018. Attention clusters: Purely attention based local feature integration for video classification. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 7834--7843.
    [32]
    Pan Lu, Hongsheng Li, Wei Zhang, Jianyong Wang, and Xiaogang Wang. 2018. Co-attending free-form regions and detections with multi-modal multiplicative feature embedding for visual question answering. In AAAI Conference on Artificial Intelligence. 7218--7225.
    [33]
    Tahmida Mahmud, Mahmudul Hasan, and Amit K. Roy-Chowdhury. 2017. Joint prediction of activity labels and starting times in untrimmed videos. In The IEEE International Conference on Computer Vision (ICCV). 5773--5782.
    [34]
    Volodymyr Mnih, Nicolas Heess, Alex Graves, and Koray Kavukcuoglu. 2014. Recurrent models of visual attention. In Advances in Neural Information Processing Systems 27, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger (Eds.). Massachusetts Institute of Technology Press. 2204--2212.
    [35]
    Soltanian Mohammad and Ghaemmaghami Shahrokh. 2019. Hierarchical concept score postprocessing and concept-wise normalization in CNN-Based video event recognition. IEEE Transactions on Multimedia 21, 1 (2019), 157--172.
    [36]
    Sangmin Oh, Anthony Hoogs, Amitha Perera, Naresh Cuntoor, Chia-Chih Chen, Jong Taek Lee, Saurajit Mukherjee, J. K. Aggarwal, Hyungtae Lee, Larry Davis, et al. 2011. A large-scale benchmark dataset for event recognition in surveillance video. In Computer Vision and Pattern Recognition (CVPR). IEEE, 3153--3160.
    [37]
    Sujoy Paul, Jawadul H. Bappy, and Amit K. Roy-Chowdhury. 2017. Non-uniform subset selection for active learning in structured data. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 3138--3147.
    [38]
    Wenjie Pei, Tadas Baltrusaitis, David M. J. Tax, and Louis-Philippe Morency. 2017. Temporal attention-gated model for robust sequence classification. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
    [39]
    Yuxin Peng, Yunzhen Zhao, and Junchao Zhang. 2019. Two-stream collaborative learning with spatial-temporal attention for video classification. IEEE Transactions on Circuits and Systems for Video Technology 29, 3 (2019), 773--786.
    [40]
    Vignesh Ramanathan, Kevin Tang, Greg Mori, and Li Fei-Fei. 2015. Learning temporal embeddings for complex video analysis. In IEEE International Conference on Computer Vision (ICCV). 4471--4479.
    [41]
    Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems 28. 91--99.
    [42]
    M. Schuster and K. K. Paliwal. 1997. Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing 45, 11 (1997), 2673--2681.
    [43]
    Shikhar Sharma, Ryan Kiros, and Ruslan Salakhutdinov. 2016. Action recognition using visual attention. ICLR (2016).
    [44]
    Karen Simonyan and Andrew Zisserman. 2014. Two-stream convolutional networks for action recognition in videos. In Advances in Neural Information Processing Systems. 568--576.
    [45]
    Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).
    [46]
    Zhixing Tan, Mingxuan Wang, Jun Xie, Yidong Chen, and Xiaodong Shi. 2018. Deep semantic role labeling with self-attention. In AAAI Conference on Artificial Intelligence. 4929--4936.
    [47]
    Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3D convolutional networks. In The IEEE International Conference on Computer Vision (ICCV). 4489--4497.
    [48]
    Heng Wang and Cordelia Schmid. 2013. Action recognition with improved trajectories. In The IEEE International Conference on Computer Vision (ICCV). 3551--3558.
    [49]
    Hongsong Wang and Liang Wang. 2017. Modeling temporal dynamics and spatial configurations of actions using two-stream recurrent neural networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
    [50]
    Jingwen Wang, Wenhao Jiang, Lin Ma, Wei Liu, and Yong Xu. 2018. Bidirectional attentive fusion with context gating for dense video captioning. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
    [51]
    Limin Wang, Yu Qiao, and Xiaoou Tang. 2015. Action recognition with trajectory-pooled deep-convolutional descriptors. In Computer Vision and Pattern Recognition (CVPR). 4305--4314.
    [52]
    Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. 2016. Temporal segment networks: Towards good practices for deep action recognition. In European Conference on Computer Vision (ECCV). 20--36.
    [53]
    Xiaoyang Wang and Qiang Ji. 2014. A hierarchical context model for event recognition in surveillance video. In Computer Vision and Pattern Recognition (CVPR). 2561--2568.
    [54]
    Xiaoyang Wang and Qiang Ji. 2015. Video event recognition with deep hierarchical context model. In Computer Vision and Pattern Recognition (CVPR). 4418--4427.
    [55]
    Xiaoyang Wang and Qiang Ji. 2017. Hierarchical context modeling for video event recognition. IEEE Transactions on Pattern Analysis 8 Machine Intelligence 39, 9 (2017), 1770--1782.
    [56]
    Yandong Wen, Kaipeng Zhang, Zhifeng Li, and Yu Qiao. 2016. A discriminative feature learning approach for deep face recognition. In European Conference on Computer Vision. Springer, 499--515.
    [57]
    Zuxuan Wu, Xi Wang, Yu-Gang Jiang, Hao Ye, and Xiangyang Xue. 2015. Modeling spatial-temporal clues in a hybrid deep learning framework for video classification. In Proceedings of the 23rd ACM International Conference on Multimedia. 461--470.
    [58]
    Yang Xian, Xuejian Rong, Xiaodong Yang, and Yingli Tian. 2017. Evaluation of low-level features for real-world surveillance event detection. IEEE Transactions on Circuits 8 Systems for Video Technology 27, 3 (2017), 624--634.
    [59]
    Wenlong Xie, Hongxun Yao, Xiaoshuai Sun, Tingting Han, Sicheng Zhao, and Tat-Seng Chua. 2019. Discovering latent discriminative patterns for multi-mode event representation. IEEE Transactions on Multimedia 21, 6 (2019), 1425--1436.
    [60]
    Huijuan Xu and Kate Saenko. 2016. Ask, attend and answer: Exploring question-guided spatial attention for visual question answering. In European Conference on Computer Vision. 451--466.
    [61]
    Zhongwen Xu, Yi Yang, and Alex G. Hauptmann. 2015. A discriminative CNN video representation for event detection. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 1798--1807.
    [62]
    Li Yikang, Yu Tianshu, and Li Baoxin. 2018. Simultaneous event localization and recognition in surveillance video. In The IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS). 1--6.
    [63]
    Tingzhao Yu, Lingfeng Wang, Cheng Da, Huxiang Gu, Shiming Xiang, and Chunhong Pan. 2019. Weakly semantic guided action recognition. IEEE Transactions on Multimedia 21, 10 (2019), 2504--2517.
    [64]
    Sanyi Zhang, Zhanjie Song, Xiaochun Cao, and Hua Zhang. 2019. Task-aware attention model for clothing attribute prediction. IEEE Transactions on Circuits and Systems for Video Technology (2019).
    [65]
    Yu Zhu, Hao Li, Yikang Liao, Beidou Wang, Ziyu Guan, Haifeng Liu, and Deng Cai. 2017. What to do next: Modeling user behaviors by time-LSTM. In Proceedings of the 26th International Joint Conference on Artificial Intelligence (IJCAI’17). 3602--3608.
    [66]
    Yingying Zhu, Nandita Nanyak, and Amit Roychowdhury. 2015. Context-aware activity modeling using hierarchical conditional random fields. IEEE Transactions on Pattern Analysis 8 Machine Intelligence 37, 7 (2015), 1360--1372.
    [67]
    Yingying Zhu, Nandita M. Nayak, and Amit K. Roy-Chowdhury. 2013. Context-aware modeling and recognition of activities in video. In Computer Vision and Pattern Recognition (CVPR). 2491--2498.
    [68]
    Yi Zhu and Shawn Newsam. 2016. Depth2action: Exploring embedded depth for large-scale action recognition. In European Conference on Computer Vision. 668--684.

    Cited By

    View all
    • (2023)Full-body Human Motion Reconstruction with Sparse Joint Tracking Using Flexible SensorsACM Transactions on Multimedia Computing, Communications, and Applications10.1145/356470020:2(1-19)Online publication date: 25-Sep-2023
    • (2023)An Overview of Video Tampering Detection Techniques: State-of-the-Art and Future Directions2023 International Conference on Computational Intelligence and Sustainable Engineering Solutions (CISES)10.1109/CISES58720.2023.10183511(171-175)Online publication date: 28-Apr-2023
    • (2023)Algorithm Used in Video Event Recognition & Classification with Hierarchical Modeling2023 IEEE World Conference on Applied Intelligence and Computing (AIC)10.1109/AIC57670.2023.10263963(608-613)Online publication date: 29-Jul-2023
    • Show More Cited By

    Index Terms

    1. Spatio-Temporal Deep Residual Network with Hierarchical Attentions for Video Event Recognition

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Transactions on Multimedia Computing, Communications, and Applications
      ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 16, Issue 2s
      Special Issue on Smart Communications and Networking for Future Video Surveillance and Special Section on Extended MMSYS-NOSSDAV 2019 Best Papers
      April 2020
      291 pages
      ISSN:1551-6857
      EISSN:1551-6865
      DOI:10.1145/3407689
      Issue’s Table of Contents
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 21 June 2020
      Online AM: 07 May 2020
      Accepted: 01 January 2020
      Revised: 01 November 2019
      Received: 01 May 2019
      Published in TOMM Volume 16, Issue 2s

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Event recognition
      2. deep residual recurrent network
      3. hierarchical attention
      4. spatio-temporal
      5. surveillance video

      Qualifiers

      • Research-article
      • Research
      • Refereed

      Funding Sources

      • Provincial Natural Science Foundation of Zhejiang
      • Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University
      • Natural Science Foundation of the Jiangsu Higher Education Institutions of China
      • National Natural Science Foundation of China
      • Natural Science Foundation of Jiangsu Province

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)20
      • Downloads (Last 6 weeks)1
      Reflects downloads up to 26 Jul 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2023)Full-body Human Motion Reconstruction with Sparse Joint Tracking Using Flexible SensorsACM Transactions on Multimedia Computing, Communications, and Applications10.1145/356470020:2(1-19)Online publication date: 25-Sep-2023
      • (2023)An Overview of Video Tampering Detection Techniques: State-of-the-Art and Future Directions2023 International Conference on Computational Intelligence and Sustainable Engineering Solutions (CISES)10.1109/CISES58720.2023.10183511(171-175)Online publication date: 28-Apr-2023
      • (2023)Algorithm Used in Video Event Recognition & Classification with Hierarchical Modeling2023 IEEE World Conference on Applied Intelligence and Computing (AIC)10.1109/AIC57670.2023.10263963(608-613)Online publication date: 29-Jul-2023
      • (2022)Action Recognition Using Action Sequences Optimization and Two-Stream 3D Dilated Neural NetworkComputational Intelligence and Neuroscience10.1155/2022/66084482022Online publication date: 1-Jan-2022
      • (2022)Temporal Dynamic Concept Modeling Network for Explainable Video Event RecognitionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/356831219:6(1-22)Online publication date: 25-Oct-2022
      • (2022)Exploiting Attention-Consistency Loss For Spatial-Temporal Stream Action RecognitionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/353874918:2s(1-15)Online publication date: 30-Jun-2022
      • (2022)An Effective Forest Fire Detection Framework Using Heterogeneous Wireless Multimedia Sensor NetworksACM Transactions on Multimedia Computing, Communications, and Applications10.1145/347303718:2(1-21)Online publication date: 31-May-2022
      • (2022)STHARNet: spatio-temporal human action recognition network in content based video retrievalMultimedia Tools and Applications10.1007/s11042-022-14056-882:24(38051-38066)Online publication date: 14-Oct-2022
      • (2021)A Continuous Semantic Embedding Method for Video Compact RepresentationElectronics10.3390/electronics1024310610:24(3106)Online publication date: 14-Dec-2021
      • (2021)Event and Activity Recognition in Video Surveillance for Cyber-Physical SystemsEmergence of Cyber Physical System and IoT in Smart Automation and Robotics10.1007/978-3-030-66222-6_4(51-68)Online publication date: 5-May-2021
      • Show More Cited By

      View Options

      Get Access

      Login options

      Full Access

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format.

      HTML Format

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media