research-article

Spatio-Temporal Deep Residual Network with Hierarchical Attentions for Video Event Recognition

Authors:

Shengrong Gong,

Haibao XuAuthors Info & Claims

ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), Volume 16, Issue 2s

Article No.: 62, Pages 1 - 21

https://doi.org/10.1145/3378026

Published: 21 June 2020 Publication History

Abstract

Event recognition in surveillance video has gained extensive attention from the computer vision community. This process still faces enormous challenges due to the tiny inter-class variations that are caused by various facets, such as severe occlusion, cluttered backgrounds, and so forth. To address these issues, we propose a spatio-temporal deep residual network with hierarchical attentions (STDRN-HA) for video event recognition. In the first attention layer, the ResNet fully connected feature guides the Faster R-CNN feature to generate object-based attention (O-attention) for target objects. In the second attention layer, the O-attention further guides the ResNet convolutional feature to yield the holistic attention (H-attention) in order to perceive more details of the occluded objects and the global background. In the third attention layer, the attention maps use the deep features to obtain the attention-enhanced features. Then, the attention-enhanced features are input into a deep residual recurrent network, which is used to mine more event clues from videos. Furthermore, an optimized loss function named softmax-RC is designed, which embeds the residual block regularization and center loss to solve the vanishing gradient in a deep network and enlarge the distance between inter-classes. We also build a temporal branch to exploit the long- and short-term motion information. The final results are obtained by fusing the outputs of the spatial and temporal streams. Experiments on the four realistic video datasets, CCV, VIRAT 1.0, VIRAT 2.0, and HMDB51, demonstrate that the proposed method has good performance and achieves state-of-the-art results.

References

[1]

Mohamed R. Amer and Sinisa Todorovic. 2012. Sum-product networks for modeling activities with stochastic structure. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 1314--1321.

[2]

Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 6077--6086.

[3]

Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? A new model and the kinetics dataset. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 6299--6308.

[4]

Wenqing Chu, Hongyang Xue, Chengwei Yao, and Deng Cai. 2019. Sparse coding guided spatiotemporal feature learning for abnormal event detection in large videos. IEEE Transactions on Multimedia 21, 1 (2019), 246--255.

Digital Library

[5]

Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. 2015. Long-term recurrent convolutional networks for visual recognition and description. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2625--2634.

[6]

Christoph Feichtenhofer, Axel Pinz, and Richard Wildes. 2016. Spatiotemporal residual networks for video action recognition. In Advances in Neural Information Processing Systems (NIPS). 3468--3476.

[7]

Christoph Feichtenhofer, Axel Pinz, and Richard P. Wildes. 2017. Spatiotemporal multiplier networks for video action recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 4768--4777.

[8]

Christoph Feichtenhofer, Axel Pinz, and Andrew Zisserman. 2016. Convolutional two-stream network fusion for video action recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 1933--1941.

[9]

Lianli Gao, Xiangpeng Li, Jingkuan Song, and Heng Tao Shen. 2019. Hierarchical LSTMs with adaptive attention for visual captioning. IEEE Transactions on Pattern Analysis and Machine Intelligence 42 (2019), 1112--1131.

[10]

Zhanning Gao, Gang Hua, Dongqing Zhang, Nebojsa Jojic, Le Wang, Jianru Xue, and Nanning Zheng. 2017. ER3: A unified framework for event retrieval, recognition and recounting. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2253--2262.

[11]

Rohit Girdhar, Deva Ramanan, Abhinav Gupta, Josef Sivic, and Bryan Russell. 2017. ActionVLAD: Learning spatio-temporal aggregation for action classification. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 971--980.

[12]

Alex Graves, Navdeep Jaitly, and Abdel-rahman Mohamed. 2013. Hybrid speech recognition with deep bidirectional LSTM. In The 2013 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU). 273--278.

[13]

Zhang Hao and Ngo Chong-Wah. 2019. A fine granularity object-level representation for event detection and recounting. IEEE Transactions on Multimedia 21, 6 (2019), 1450--1463.

Digital Library

[14]

Kensho Hara, Hirokatsu Kataoka, and Yutaka Satoh. 2018. Can spatiotemporal 3D CNNs retrace the history of 2D CNNs and ImageNet?. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 6546--6555.

[15]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Computer Vision and Pattern Recognition (CVPR).

[16]

Chiori Hori, Huda Alamri, Jue Wang, Gordon Wichern, Takaaki Hori, Anoop Cherian, Tim K. Marks, Vincent Cartillier, Raphael Gontijo Lopes, Abhishek Das, Irfan Essa, Dhruv Batra, and Devi Parikh. 2019. End-to-end audio visual scene-aware dialog using multimodal attention-based video features. In The IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2352--2356.

[17]

Jingyi Hou, Xinxiao Wu, Feiwu Yu, and Yunde Jia. 2016. Multimedia event detection via deep spatial-temporal neural networks. In The IEEE International Conference on Multimedia and Expo. 1--6.

[18]

Zhong Ji, Kailin Xiong, Yanwei Pang, and Xuelong Li. 2019. Video summarization with attention-based encoder-decoder networks. IEEE Transactions on Circuits and Systems for Video Technology (2019).

[19]

Yu Gang Jiang, Chong Wah Ngo, and Jun Yang. 2007. Towards optimal bag-of-features for object categorization and semantic video retrieval. In ACM International Conference on Image and Video Retrieval. 494--501.

Digital Library

[20]

Yu Gang Jiang, Guangnan Ye, Shih Fu Chang, Daniel P. W. Ellis, and Alexander C. Loui. 2011. Consumer video understanding: A benchmark database and an evaluation of human and machine performance. In Proceedings of ACM International Conference on Multimedia Retrieval (ICMR).

[21]

Hou Jingyi, Wu Xinxiao, Sun Yuchao, and Jia Yunde. 2018. Content-attention representation by factorized action-scene network for action recognition. IEEE Transactions on Multimedia 20, 6 (2018), 1537--1547.

[22]

Hilde Kuehne, Hueihan Jhuang, Rainer Stiefelhagen, and Thomas Serre. 2011. HMDB51: A large video database for human motion recognition. In The IEEE International Conference on Computer Vision (ICCV). 2556--2563.

[23]

N. Kumaran, A. Vadivel, and S. Saravana Kumar. 2018. Recognition of human actions using CNN-GWO: A novel modeling of CNN for enhancement of classification performance. Multimedia Tools 8 Applications 77, 18 (2018), 23115--23147.

[24]

Inwoong Lee, Doyoung Kim, Seoungyoon Kang, and Sanghoon Lee. 2017. Ensemble deep learning for skeleton-based action recognition using temporal sliding LSTM networks. In The IEEE International Conference on Computer Vision (ICCV).

[25]

Chao Li, Jiewei Cao, Zi Huang, Lei Zhu, and Heng Tao Shen. 2017. Leveraging weak semantic relevance for complex video event classification. In The IEEE International Conference on Computer Vision (ICCV). 3647--3656.

[26]

Chao Li, Zi Huang, Yang Yang, Jiewei Cao, Xiaoshuai Sun, and Heng Tao Shen. 2017. Hierarchical latent concept discovery for video event detection. IEEE Transactions on Image Processing 26, 5 (2017), 2149--2162.

Digital Library

[27]

Yonggang Li, Zhaohui Wang, Xiaoyi Wan, Husheng Dong, Shengrong Gong, Chunping Liu, Yi Ji, and Rong Zhu. 2018. Deep residual dual unidirectional DLSTM for video event recognition with spatial-temporal consistency. Chinese Journal of Computers 41, 12 (2018), 2852--2866.

[28]

Zhenyang Li, Kirill Gavrilyuk, Efstratios Gavves, Mihir Jain, and Cees G. M. Snoek. 2018. VideoLSTM convolves, attends and flows for action recognition. Computer Vision and Image Understanding 166 (2018), 41--50.

Digital Library

[29]

Jun Liu, Gang Wang, Ping Hu, Ling-Yu Duan, and Alex C. Kot. 2017. Global context-aware attention LSTM networks for 3D action recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]

Xiang Long, Chuang Gan, Gerard de Melo, Xiao Liu, Yandong Li, Fu Li, and Shilei Wen. 2018. Multimodal keyless attention fusion for video classification. In AAAI.

[31]

Xiang Long, Chuang Gan, Gerard de Melo, Jiajun Wu, Xiao Liu, and Shilei Wen. 2018. Attention clusters: Purely attention based local feature integration for video classification. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 7834--7843.

[32]

Pan Lu, Hongsheng Li, Wei Zhang, Jianyong Wang, and Xiaogang Wang. 2018. Co-attending free-form regions and detections with multi-modal multiplicative feature embedding for visual question answering. In AAAI Conference on Artificial Intelligence. 7218--7225.

[33]

Tahmida Mahmud, Mahmudul Hasan, and Amit K. Roy-Chowdhury. 2017. Joint prediction of activity labels and starting times in untrimmed videos. In The IEEE International Conference on Computer Vision (ICCV). 5773--5782.

[34]

Volodymyr Mnih, Nicolas Heess, Alex Graves, and Koray Kavukcuoglu. 2014. Recurrent models of visual attention. In Advances in Neural Information Processing Systems 27, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger (Eds.). Massachusetts Institute of Technology Press. 2204--2212.

[35]

Soltanian Mohammad and Ghaemmaghami Shahrokh. 2019. Hierarchical concept score postprocessing and concept-wise normalization in CNN-Based video event recognition. IEEE Transactions on Multimedia 21, 1 (2019), 157--172.

Digital Library

[36]

Sangmin Oh, Anthony Hoogs, Amitha Perera, Naresh Cuntoor, Chia-Chih Chen, Jong Taek Lee, Saurajit Mukherjee, J. K. Aggarwal, Hyungtae Lee, Larry Davis, et al. 2011. A large-scale benchmark dataset for event recognition in surveillance video. In Computer Vision and Pattern Recognition (CVPR). IEEE, 3153--3160.

[37]

Sujoy Paul, Jawadul H. Bappy, and Amit K. Roy-Chowdhury. 2017. Non-uniform subset selection for active learning in structured data. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 3138--3147.

[38]

Wenjie Pei, Tadas Baltrusaitis, David M. J. Tax, and Louis-Philippe Morency. 2017. Temporal attention-gated model for robust sequence classification. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39]

Yuxin Peng, Yunzhen Zhao, and Junchao Zhang. 2019. Two-stream collaborative learning with spatial-temporal attention for video classification. IEEE Transactions on Circuits and Systems for Video Technology 29, 3 (2019), 773--786.

Digital Library

[40]

Vignesh Ramanathan, Kevin Tang, Greg Mori, and Li Fei-Fei. 2015. Learning temporal embeddings for complex video analysis. In IEEE International Conference on Computer Vision (ICCV). 4471--4479.

Digital Library

[41]

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems 28. 91--99.

[42]

M. Schuster and K. K. Paliwal. 1997. Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing 45, 11 (1997), 2673--2681.

Digital Library

[43]

Shikhar Sharma, Ryan Kiros, and Ruslan Salakhutdinov. 2016. Action recognition using visual attention. ICLR (2016).

[44]

Karen Simonyan and Andrew Zisserman. 2014. Two-stream convolutional networks for action recognition in videos. In Advances in Neural Information Processing Systems. 568--576.

[45]

Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).

[46]

Zhixing Tan, Mingxuan Wang, Jun Xie, Yidong Chen, and Xiaodong Shi. 2018. Deep semantic role labeling with self-attention. In AAAI Conference on Artificial Intelligence. 4929--4936.

[47]

Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3D convolutional networks. In The IEEE International Conference on Computer Vision (ICCV). 4489--4497.

Digital Library

[48]

Heng Wang and Cordelia Schmid. 2013. Action recognition with improved trajectories. In The IEEE International Conference on Computer Vision (ICCV). 3551--3558.

Digital Library

[49]

Hongsong Wang and Liang Wang. 2017. Modeling temporal dynamics and spatial configurations of actions using two-stream recurrent neural networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[50]

Jingwen Wang, Wenhao Jiang, Lin Ma, Wei Liu, and Yong Xu. 2018. Bidirectional attentive fusion with context gating for dense video captioning. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[51]

Limin Wang, Yu Qiao, and Xiaoou Tang. 2015. Action recognition with trajectory-pooled deep-convolutional descriptors. In Computer Vision and Pattern Recognition (CVPR). 4305--4314.

[52]

Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. 2016. Temporal segment networks: Towards good practices for deep action recognition. In European Conference on Computer Vision (ECCV). 20--36.

[53]

Xiaoyang Wang and Qiang Ji. 2014. A hierarchical context model for event recognition in surveillance video. In Computer Vision and Pattern Recognition (CVPR). 2561--2568.

[54]

Xiaoyang Wang and Qiang Ji. 2015. Video event recognition with deep hierarchical context model. In Computer Vision and Pattern Recognition (CVPR). 4418--4427.

[55]

Xiaoyang Wang and Qiang Ji. 2017. Hierarchical context modeling for video event recognition. IEEE Transactions on Pattern Analysis 8 Machine Intelligence 39, 9 (2017), 1770--1782.

Digital Library

[56]

Yandong Wen, Kaipeng Zhang, Zhifeng Li, and Yu Qiao. 2016. A discriminative feature learning approach for deep face recognition. In European Conference on Computer Vision. Springer, 499--515.

[57]

Zuxuan Wu, Xi Wang, Yu-Gang Jiang, Hao Ye, and Xiangyang Xue. 2015. Modeling spatial-temporal clues in a hybrid deep learning framework for video classification. In Proceedings of the 23rd ACM International Conference on Multimedia. 461--470.

Digital Library

[58]

Yang Xian, Xuejian Rong, Xiaodong Yang, and Yingli Tian. 2017. Evaluation of low-level features for real-world surveillance event detection. IEEE Transactions on Circuits 8 Systems for Video Technology 27, 3 (2017), 624--634.

Digital Library

[59]

Wenlong Xie, Hongxun Yao, Xiaoshuai Sun, Tingting Han, Sicheng Zhao, and Tat-Seng Chua. 2019. Discovering latent discriminative patterns for multi-mode event representation. IEEE Transactions on Multimedia 21, 6 (2019), 1425--1436.

Digital Library

[60]

Huijuan Xu and Kate Saenko. 2016. Ask, attend and answer: Exploring question-guided spatial attention for visual question answering. In European Conference on Computer Vision. 451--466.

[61]

Zhongwen Xu, Yi Yang, and Alex G. Hauptmann. 2015. A discriminative CNN video representation for event detection. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 1798--1807.

[62]

Li Yikang, Yu Tianshu, and Li Baoxin. 2018. Simultaneous event localization and recognition in surveillance video. In The IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS). 1--6.

[63]

Tingzhao Yu, Lingfeng Wang, Cheng Da, Huxiang Gu, Shiming Xiang, and Chunhong Pan. 2019. Weakly semantic guided action recognition. IEEE Transactions on Multimedia 21, 10 (2019), 2504--2517.

Digital Library

[64]

Sanyi Zhang, Zhanjie Song, Xiaochun Cao, and Hua Zhang. 2019. Task-aware attention model for clothing attribute prediction. IEEE Transactions on Circuits and Systems for Video Technology (2019).

[65]

Yu Zhu, Hao Li, Yikang Liao, Beidou Wang, Ziyu Guan, Haifeng Liu, and Deng Cai. 2017. What to do next: Modeling user behaviors by time-LSTM. In Proceedings of the 26th International Joint Conference on Artificial Intelligence (IJCAI’17). 3602--3608.

Digital Library

[66]

Yingying Zhu, Nandita Nanyak, and Amit Roychowdhury. 2015. Context-aware activity modeling using hierarchical conditional random fields. IEEE Transactions on Pattern Analysis 8 Machine Intelligence 37, 7 (2015), 1360--1372.

Digital Library

[67]

Yingying Zhu, Nandita M. Nayak, and Amit K. Roy-Chowdhury. 2013. Context-aware modeling and recognition of activities in video. In Computer Vision and Pattern Recognition (CVPR). 2491--2498.

[68]

Yi Zhu and Shawn Newsam. 2016. Depth2action: Exploring embedded depth for large-scale action recognition. In European Conference on Computer Vision. 668--684.

Cited By

Chen XJiang XZhan LGuo SRuan QLuo GLiao MQin Y(2023)Full-body Human Motion Reconstruction with Sparse Joint Tracking Using Flexible SensorsACM Transactions on Multimedia Computing, Communications, and Applications10.1145/356470020:2(1-19)Online publication date: 25-Sep-2023
https://dl.acm.org/doi/10.1145/3564700
Pandey SKumar LKumar GKumar ASingh KSingh T(2023)An Overview of Video Tampering Detection Techniques: State-of-the-Art and Future Directions2023 International Conference on Computational Intelligence and Sustainable Engineering Solutions (CISES)10.1109/CISES58720.2023.10183511(171-175)Online publication date: 28-Apr-2023
https://doi.org/10.1109/CISES58720.2023.10183511
Bansal SKaur MRana APareek SKumar RAlkhayyat A(2023)Algorithm Used in Video Event Recognition & Classification with Hierarchical Modeling2023 IEEE World Conference on Applied Intelligence and Computing (AIC)10.1109/AIC57670.2023.10263963(608-613)Online publication date: 29-Jul-2023
https://doi.org/10.1109/AIC57670.2023.10263963
Show More Cited By

Index Terms

Spatio-Temporal Deep Residual Network with Hierarchical Attentions for Video Event Recognition
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
        Activity recognition and understanding

Recommendations

Expression recognition using fuzzy spatio-temporal modeling

In human-computer interaction, there is a need for computer to recognize human facial expression accurately. This paper proposes a novel and effective approach for facial expression recognition that analyzes a sequence of images (displaying one ...
Single image super-resolution using deep hierarchical attention network
ICMIP '20: Proceedings of the 5th International Conference on Multimedia and Image Processing

In this paper, we present a compact and accurate super-resolution algorithm using the attention-augmented convolutional neural network, which can exploit and weight hierarchical features at multiple scales and levels to improve learning capability. The ...
Scene text recognition using residual convolutional recurrent neural network

Text is a significant tool for human communication, and text recognition in scene images becomes more and more important. In this paper, we propose a residual convolutional recurrent neural network for solving the task of scene text recognition. The ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Multimedia Computing, Communications, and Applications

ACM Transactions on Multimedia Computing, Communications, and Applications Volume 16, Issue 2s

Special Issue on Smart Communications and Networking for Future Video Surveillance and Special Section on Extended MMSYS-NOSSDAV 2019 Best Papers

April 2020

291 pages

ISSN:1551-6857

EISSN:1551-6865

DOI:10.1145/3407689

Editor:
Alberto Del Bimbo
University of Firenze, Italy

Issue’s Table of Contents

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 21 June 2020

Online AM: 07 May 2020

Accepted: 01 January 2020

Revised: 01 November 2019

Received: 01 May 2019

Published in TOMM Volume 16, Issue 2s

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

Provincial Natural Science Foundation of Zhejiang
Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University
Natural Science Foundation of the Jiangsu Higher Education Institutions of China
National Natural Science Foundation of China
Natural Science Foundation of Jiangsu Province

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

14
Total Citations
View Citations
279
Total Downloads

Downloads (Last 12 months)20
Downloads (Last 6 weeks)1

Reflects downloads up to 26 Jul 2024

Other Metrics

View Author Metrics

Citations

Cited By

Chen XJiang XZhan LGuo SRuan QLuo GLiao MQin Y(2023)Full-body Human Motion Reconstruction with Sparse Joint Tracking Using Flexible SensorsACM Transactions on Multimedia Computing, Communications, and Applications10.1145/356470020:2(1-19)Online publication date: 25-Sep-2023
https://dl.acm.org/doi/10.1145/3564700
Pandey SKumar LKumar GKumar ASingh KSingh T(2023)An Overview of Video Tampering Detection Techniques: State-of-the-Art and Future Directions2023 International Conference on Computational Intelligence and Sustainable Engineering Solutions (CISES)10.1109/CISES58720.2023.10183511(171-175)Online publication date: 28-Apr-2023
https://doi.org/10.1109/CISES58720.2023.10183511
Bansal SKaur MRana APareek SKumar RAlkhayyat A(2023)Algorithm Used in Video Event Recognition & Classification with Hierarchical Modeling2023 IEEE World Conference on Applied Intelligence and Computing (AIC)10.1109/AIC57670.2023.10263963(608-613)Online publication date: 29-Jul-2023
https://doi.org/10.1109/AIC57670.2023.10263963
Xiong XMin WHan QWang QZha C(2022)Action Recognition Using Action Sequences Optimization and Two-Stream 3D Dilated Neural NetworkComputational Intelligence and Neuroscience10.1155/2022/66084482022Online publication date: 1-Jan-2022
https://dl.acm.org/doi/10.1155/2022/6608448
Zhang WQi ZWang SSu CSu LHuang Q(2022)Temporal Dynamic Concept Modeling Network for Explainable Video Event RecognitionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/356831219:6(1-22)Online publication date: 25-Oct-2022
https://dl.acm.org/doi/10.1145/3568312
Xu HJin XWang QHussain AHuang K(2022)Exploiting Attention-Consistency Loss For Spatial-Temporal Stream Action RecognitionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/353874918:2s(1-15)Online publication date: 30-Jun-2022
https://doi.org/10.1145/3538749
Kizilkaya BEver EYatbaz HYazici A(2022)An Effective Forest Fire Detection Framework Using Heterogeneous Wireless Multimedia Sensor NetworksACM Transactions on Multimedia Computing, Communications, and Applications10.1145/347303718:2(1-21)Online publication date: 31-May-2022
https://dl.acm.org/doi/10.1145/3473037
Sowmyayani SRani P(2022)STHARNet: spatio-temporal human action recognition network in content based video retrievalMultimedia Tools and Applications10.1007/s11042-022-14056-882:24(38051-38066)Online publication date: 14-Oct-2022
https://doi.org/10.1007/s11042-022-14056-8
Han TQi YZhu S(2021)A Continuous Semantic Embedding Method for Video Compact RepresentationElectronics10.3390/electronics1024310610:24(3106)Online publication date: 14-Dec-2021
https://doi.org/10.3390/electronics10243106
Bhaumik SJana PMohanta P(2021)Event and Activity Recognition in Video Surveillance for Cyber-Physical SystemsEmergence of Cyber Physical System and IoT in Smart Automation and Robotics10.1007/978-3-030-66222-6_4(51-68)Online publication date: 5-May-2021
https://doi.org/10.1007/978-3-030-66222-6_4
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Issue’s Table of Contents