Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3343031.3350978acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Long Short-Term Relation Networks for Video Action Detection

Published: 15 October 2019 Publication History

Abstract

It has been well recognized that modeling human-object or object-object relations would be helpful for detection task. Nevertheless, the problem is not trivial especially when exploring the interactions between human actor, object and scene (collectively as human-context) to boost video action detectors. The difficulty originates from the aspect that reliable relations in a video should depend on not only short-term human-context relation in the present clip but also the temporal dynamics distilled over a long-range span of the video. This motivates us to capture both short-term and long-term relations in a video. In this paper, we present a new Long Short-Term Relation Networks, dubbed as LSTR, that novelly aggregates and propagates relation to augment features for video action detection. Technically, Region Proposal Networks (RPN) is remoulded to first produce 3D bounding boxes, i.e., tubelets, in each video clip. LSTR then models short-term human-context interactions within each clip through spatio-temporal attention mechanism and reasons long-term temporal dynamics across video clips via Graph Convolutional Networks (GCN) in a cascaded manner. Extensive experiments are conducted on four benchmark datasets, and superior results are reported when comparing to state-of-the-art methods.

References

[1]
Jo ao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR .
[2]
Jiajun Deng, Yingwei Pan, Ting Yao, Wengang Zhou, Houqiang Li, and Tao Mei. 2019. Relation Distillation Networks for Video Object Detection. In ICCV .
[3]
Christoph Feichtenhofer, Axel Pinz, and Andrew Zisserman. 2017. Detect to track and track to detect. In ICCV .
[4]
Rohit Girdhar, Jo ao Carreira, Carl Doersch, and Andrew Zisserman. 2019. Video Action Transformer Network. In CVPR .
[5]
Ross Girshick. 2015. Fast r-cnn. In ICCV .
[6]
Georgia Gkioxari, Ross Girshick, Piotr Dollár, and Kaiming He. 2018. Detecting and recognizing human-object interactions. In CVPR .
[7]
Georgia Gkioxari and Jitendra Malik. 2015. Finding action tubes. In CVPR .
[8]
Chunhui Gu, Chen Sun, David A Ross, Carl Vondrick, Caroline Pantofaru, Yeqing Li, Sudheendra Vijayanarasimhan, George Toderici, Susanna Ricco, Rahul Sukthankar, et almbox. 2018. Ava: A video dataset of spatio-temporally localized atomic visual actions. In CVPR .
[9]
Wei Han, Pooya Khorrami, Tom Le Paine, Prajit Ramachandran, Mohammad Babaeizadeh, Honghui Shi, Jianan Li, Shuicheng Yan, and Thomas S Huang. 2016. Seq-nms for video object detection. arXiv preprint arXiv:1602.08465 (2016).
[10]
Jiawei He, Mostafa S Ibrahim, Zhiwei Deng, and Greg Mori. 2018. Generic Tubelet Proposals for Action Localization. In WACV .
[11]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In CVPR .
[12]
Rui Hou, Chen Chen, and Mubarak Shah. 2017. Tube Convolutional Neural Network (T-CNN) for Action Detection in Videos. In ICCV .
[13]
Han Hu, Jiayuan Gu, Zheng Zhang, Jifeng Dai, and Yichen Wei. 2018. Relation networks for object detection. In CVPR .
[14]
Hueihan Jhuang, Juergen Gall, Silvia Zuffi, Cordelia Schmid, and Michael J Black. 2013. Towards understanding action recognition. In ICCV .
[15]
Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional architecture for fast feature embedding. In ACM MM .
[16]
Jianwen Jiang and et al. 2018. Human Centric Spatio-Temporal Action Localization. In ActivityNet Workshop on CVPR .
[17]
Vicky Kalogeiton, Philippe Weinzaepfel, Vittorio Ferrari, and Cordelia Schmid. 2017. Action tubelet detector for spatio-temporal action localization. In ICCV .
[18]
Kai Kang, Hongsheng Li, Tong Xiao, Wanli Ouyang, Junjie Yan, Xihui Liu, and Xiaogang Wang. 2017. Object detection in videos with tubelet proposal networks. In CVPR .
[19]
Kai Kang, Hongsheng Li, Junjie Yan, Xingyu Zeng, Bin Yang, Tong Xiao, Cong Zhang, Zhe Wang, Ruohui Wang, Xiaogang Wang, and Wanli Ouyang. 2018. T-cnn: Tubelets with convolutional neural networks for object detection from videos. IEEE Transactions on Circuits and Systems for Video Technology (2018).
[20]
Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei. 2014. Large-scale video classification with convolutional neural networks. In CVPR .
[21]
Thomas N Kipf and Max Welling. 2017. Semi-supervised classification with graph convolutional networks. In ICLR .
[22]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. In NIPS .
[23]
Tian Lan, Yang Wang, and Greg Mori. 2011. Discriminative figure-centric models for joint action localization and recognition. In ICCV .
[24]
Dong Li, Zhaofan Qiu, Qi Dai, Ting Yao, and Tao Mei. 2018a. Recurrent tubelet proposal and recognition networks for action detection. In ECCV .
[25]
Dong Li, Ting Yao, Ling-Yu Duan, Tao Mei, and Yong Rui. 2018b. Unified spatio-temporal attention networks for action recognition in videos. IEEE Transactions on Multimedia, Vol. 21, 2 (2018), 416--428.
[26]
Ilya Loshchilov and Frank Hutter. 2017. Sgdr: Stochastic gradient descent with warm restarts. In ICLR .
[27]
Chih-Yao Ma, Asim Kadav, Iain Melvin, Zsolt Kira, Ghassan AlRegib, and Hans Peter Graf. 2018. Attend and interact: Higher-order object interactions for video understanding. In CVPR .
[28]
Xiaojiang Peng and Cordelia Schmid. 2016. Multi-region two-stream R-CNN for action detection. In ECCV .
[29]
Siyuan Qi, Wenguan Wang, Baoxiong Jia, Jianbing Shen, and Song-Chun Zhu. 2018. Learning human-object interactions by graph parsing neural networks. In ECCV .
[30]
Zhaofan Qiu, Ting Yao, and Tao Mei. 2017a. Deep quantization: Encoding convolutional activations with deep generative model. In CVPR .
[31]
Zhaofan Qiu, Ting Yao, and Tao Mei. 2017b. Learning spatio-temporal representation with pseudo-3d residual networks. In ICCV .
[32]
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. In NIPS .
[33]
Mikel D Rodriguez, Javed Ahmed, and Mubarak Shah. 2008. Action mach a spatio-temporal maximum average correlation height filter for action recognition. In CVPR .
[34]
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et almbox. 2015. Imagenet large scale visual recognition challenge. International journal of computer vision, Vol. 115, 3 (2015), 211--252.
[35]
Suman Saha, Gurkirt Singh, Michael Sapienza, Philip HS Torr, and Fabio Cuzzolin. 2016. Deep learning for detecting multiple space-time action tubes in videos. In BMVC .
[36]
Karen Simonyan and Andrew Zisserman. 2014. Two-stream convolutional networks for action recognition in videos. In NIPS .
[37]
Karen Simonyan and Andrew Zisserman. 2015. Very deep convolutional networks for large-scale image recognition. In ICLR .
[38]
Gurkirt Singh, Suman Saha, and Fabio Cuzzolin. 2017. Online real time multiple spatiotemporal action localisation and prediction. In ICCV .
[39]
Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. 2012. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012).
[40]
Chen Sun, Abhinav Shrivastava, Carl Vondrick, Kevin Murphy, Rahul Sukthankar, and Cordelia Schmid. 2018. Actor-centric relation network. In ECCV .
[41]
Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3d convolutional networks. In ICCV .
[42]
Oytun Ulutan, Swati Rallapalli, Mudhakar Srivatsa, and BS Manjunath. 2018. Actor Conditioned Attention Maps for Video Action Detection. arXiv preprint arXiv:1812.11631 (2018).
[43]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In NIPS .
[44]
Heng Wang, Alexander Kl"aser, Cordelia Schmid, and Liu Cheng-Lin. 2011. Action recognition by dense trajectories. In CVPR .
[45]
Heng Wang and Cordelia Schmid. 2013. Action recognition with improved trajectories. In ICCV .
[46]
Shiyao Wang, Yucong Zhou, Junjie Yan, and Zhidong Deng. 2018. Fully Motion-Aware Network for Video Object Detection. In ECCV .
[47]
Xiaolong Wang and Abhinav Gupta. 2018. Videos as space-time region graphs. In ECCV .
[48]
Philippe Weinzaepfel, Zaid Harchaoui, and Cordelia Schmid. 2015. Learning to track for spatio-temporal action localization. In ICCV .
[49]
Chao-Yuan Wu, Christoph Feichtenhofer, Haoqi Fan, Kaiming He, Philipp Krahenbuhl, and Ross Girshick. 2019. Long-term feature banks for detailed video understanding. In CVPR .
[50]
Fanyi Xiao and Yong Jae Lee. 2018. Video object detection with an aligned spatial-temporal memory. In ECCV .
[51]
Zhenheng Yang, Jiyang Gao, and Ram Nevatia. 2017. Spatio-Temporal Action Detection with Cascade Proposal and Location Anticipation. In BMVC .
[52]
Bangpeng Yao and Li Fei-Fei. 2010. Modeling mutual context of object and human pose in human-object interaction activities. In CVPR .
[53]
Ting Yao, Yingwei Pan, Yehao Li, and Tao Mei. 2018. Exploring visual relationship for image captioning. In ECCV .
[54]
Ting Yao, Yingwei Pan, Yehao Li, and Tao Mei. 2019. Hierarchy Parsing for Image Captioning. In ICCV .
[55]
Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba. 2016. Learning deep features for discriminative localization. In CVPR .
[56]
Xizhou Zhu, Jifeng Dai, Lu Yuan, and Yichen Wei. 2018. Towards high performance video object detection. In CVPR .
[57]
Xizhou Zhu, Yujie Wang, Jifeng Dai, Lu Yuan, and Yichen Wei. 2017. Flow-guided feature aggregation for video object detection. In ICCV .

Cited By

View all
  • (2024)A survey on deep learning-based spatio-temporal action detectionInternational Journal of Wavelets, Multiresolution and Information Processing10.1142/S021969132350066222:04Online publication date: 9-Feb-2024
  • (2024)Action-Semantic Consistent Knowledge for Weakly-Supervised Action LocalizationIEEE Transactions on Multimedia10.1109/TMM.2024.340571026(10279-10289)Online publication date: 2024
  • (2024)Learning Temporal Dynamics in Videos With Image TransformerIEEE Transactions on Multimedia10.1109/TMM.2024.338366226(8915-8927)Online publication date: 11-Apr-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MM '19: Proceedings of the 27th ACM International Conference on Multimedia
October 2019
2794 pages
ISBN:9781450368896
DOI:10.1145/3343031
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 October 2019

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. action detection
  2. action recognition
  3. attention
  4. relation

Qualifiers

  • Research-article

Funding Sources

  • National Natural Science Foundation of China

Conference

MM '19
Sponsor:

Acceptance Rates

MM '19 Paper Acceptance Rate 252 of 936 submissions, 27%;
Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)16
  • Downloads (Last 6 weeks)1
Reflects downloads up to 31 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)A survey on deep learning-based spatio-temporal action detectionInternational Journal of Wavelets, Multiresolution and Information Processing10.1142/S021969132350066222:04Online publication date: 9-Feb-2024
  • (2024)Action-Semantic Consistent Knowledge for Weakly-Supervised Action LocalizationIEEE Transactions on Multimedia10.1109/TMM.2024.340571026(10279-10289)Online publication date: 2024
  • (2024)Learning Temporal Dynamics in Videos With Image TransformerIEEE Transactions on Multimedia10.1109/TMM.2024.338366226(8915-8927)Online publication date: 11-Apr-2024
  • (2024)Free-Form Composition Networks for Egocentric Action RecognitionIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2024.340654634:10(9967-9978)Online publication date: Oct-2024
  • (2024)Video Visual Relation Detection Based on Trajectory Fusion2024 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN60899.2024.10650663(1-9)Online publication date: 30-Jun-2024
  • (2024)Online spatio-temporal action detection with adaptive sampling and hierarchical modulationMultimedia Systems10.1007/s00530-024-01543-130:6Online publication date: 20-Nov-2024
  • (2022)Attention Guided Relation Detection Approach for Video Visual Relation DetectionIEEE Transactions on Multimedia10.1109/TMM.2021.310943024(3896-3907)Online publication date: 2022
  • (2022)Spatial-Temporal Action Localization With Hierarchical Self-AttentionIEEE Transactions on Multimedia10.1109/TMM.2021.305689224(625-639)Online publication date: 2022
  • (2022)Stand-Alone Inter-Frame Attention in Video Models2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52688.2022.00319(3182-3191)Online publication date: Jun-2022
  • (2021)Identity-aware Graph Memory Network for Action DetectionProceedings of the 29th ACM International Conference on Multimedia10.1145/3474085.3475503(3437-3445)Online publication date: 17-Oct-2021
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media