research-article

Long Short-Term Relation Networks for Video Action Detection

Authors:

Tao MeiAuthors Info & Claims

MM '19: Proceedings of the 27th ACM International Conference on Multimedia

Pages 629 - 637

https://doi.org/10.1145/3343031.3350978

Published: 15 October 2019 Publication History

Abstract

It has been well recognized that modeling human-object or object-object relations would be helpful for detection task. Nevertheless, the problem is not trivial especially when exploring the interactions between human actor, object and scene (collectively as human-context) to boost video action detectors. The difficulty originates from the aspect that reliable relations in a video should depend on not only short-term human-context relation in the present clip but also the temporal dynamics distilled over a long-range span of the video. This motivates us to capture both short-term and long-term relations in a video. In this paper, we present a new Long Short-Term Relation Networks, dubbed as LSTR, that novelly aggregates and propagates relation to augment features for video action detection. Technically, Region Proposal Networks (RPN) is remoulded to first produce 3D bounding boxes, i.e., tubelets, in each video clip. LSTR then models short-term human-context interactions within each clip through spatio-temporal attention mechanism and reasons long-term temporal dynamics across video clips via Graph Convolutional Networks (GCN) in a cascaded manner. Extensive experiments are conducted on four benchmark datasets, and superior results are reported when comparing to state-of-the-art methods.

References

[1]

Jo ao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR .

[2]

Jiajun Deng, Yingwei Pan, Ting Yao, Wengang Zhou, Houqiang Li, and Tao Mei. 2019. Relation Distillation Networks for Video Object Detection. In ICCV .

[3]

Christoph Feichtenhofer, Axel Pinz, and Andrew Zisserman. 2017. Detect to track and track to detect. In ICCV .

[4]

Rohit Girdhar, Jo ao Carreira, Carl Doersch, and Andrew Zisserman. 2019. Video Action Transformer Network. In CVPR .

[5]

Ross Girshick. 2015. Fast r-cnn. In ICCV .

[6]

Georgia Gkioxari, Ross Girshick, Piotr Dollár, and Kaiming He. 2018. Detecting and recognizing human-object interactions. In CVPR .

[7]

Georgia Gkioxari and Jitendra Malik. 2015. Finding action tubes. In CVPR .

[8]

Chunhui Gu, Chen Sun, David A Ross, Carl Vondrick, Caroline Pantofaru, Yeqing Li, Sudheendra Vijayanarasimhan, George Toderici, Susanna Ricco, Rahul Sukthankar, et almbox. 2018. Ava: A video dataset of spatio-temporally localized atomic visual actions. In CVPR .

[9]

Wei Han, Pooya Khorrami, Tom Le Paine, Prajit Ramachandran, Mohammad Babaeizadeh, Honghui Shi, Jianan Li, Shuicheng Yan, and Thomas S Huang. 2016. Seq-nms for video object detection. arXiv preprint arXiv:1602.08465 (2016).

[10]

Jiawei He, Mostafa S Ibrahim, Zhiwei Deng, and Greg Mori. 2018. Generic Tubelet Proposals for Action Localization. In WACV .

[11]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In CVPR .

[12]

Rui Hou, Chen Chen, and Mubarak Shah. 2017. Tube Convolutional Neural Network (T-CNN) for Action Detection in Videos. In ICCV .

[13]

Han Hu, Jiayuan Gu, Zheng Zhang, Jifeng Dai, and Yichen Wei. 2018. Relation networks for object detection. In CVPR .

[14]

Hueihan Jhuang, Juergen Gall, Silvia Zuffi, Cordelia Schmid, and Michael J Black. 2013. Towards understanding action recognition. In ICCV .

[15]

Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional architecture for fast feature embedding. In ACM MM .

Digital Library

[16]

Jianwen Jiang and et al. 2018. Human Centric Spatio-Temporal Action Localization. In ActivityNet Workshop on CVPR .

[17]

Vicky Kalogeiton, Philippe Weinzaepfel, Vittorio Ferrari, and Cordelia Schmid. 2017. Action tubelet detector for spatio-temporal action localization. In ICCV .

[18]

Kai Kang, Hongsheng Li, Tong Xiao, Wanli Ouyang, Junjie Yan, Xihui Liu, and Xiaogang Wang. 2017. Object detection in videos with tubelet proposal networks. In CVPR .

[19]

Kai Kang, Hongsheng Li, Junjie Yan, Xingyu Zeng, Bin Yang, Tong Xiao, Cong Zhang, Zhe Wang, Ruohui Wang, Xiaogang Wang, and Wanli Ouyang. 2018. T-cnn: Tubelets with convolutional neural networks for object detection from videos. IEEE Transactions on Circuits and Systems for Video Technology (2018).

Digital Library

[20]

Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei. 2014. Large-scale video classification with convolutional neural networks. In CVPR .

[21]

Thomas N Kipf and Max Welling. 2017. Semi-supervised classification with graph convolutional networks. In ICLR .

[22]

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. In NIPS .

[23]

Tian Lan, Yang Wang, and Greg Mori. 2011. Discriminative figure-centric models for joint action localization and recognition. In ICCV .

[24]

Dong Li, Zhaofan Qiu, Qi Dai, Ting Yao, and Tao Mei. 2018a. Recurrent tubelet proposal and recognition networks for action detection. In ECCV .

[25]

Dong Li, Ting Yao, Ling-Yu Duan, Tao Mei, and Yong Rui. 2018b. Unified spatio-temporal attention networks for action recognition in videos. IEEE Transactions on Multimedia, Vol. 21, 2 (2018), 416--428.

Digital Library

[26]

Ilya Loshchilov and Frank Hutter. 2017. Sgdr: Stochastic gradient descent with warm restarts. In ICLR .

[27]

Chih-Yao Ma, Asim Kadav, Iain Melvin, Zsolt Kira, Ghassan AlRegib, and Hans Peter Graf. 2018. Attend and interact: Higher-order object interactions for video understanding. In CVPR .

[28]

Xiaojiang Peng and Cordelia Schmid. 2016. Multi-region two-stream R-CNN for action detection. In ECCV .

[29]

Siyuan Qi, Wenguan Wang, Baoxiong Jia, Jianbing Shen, and Song-Chun Zhu. 2018. Learning human-object interactions by graph parsing neural networks. In ECCV .

[30]

Zhaofan Qiu, Ting Yao, and Tao Mei. 2017a. Deep quantization: Encoding convolutional activations with deep generative model. In CVPR .

[31]

Zhaofan Qiu, Ting Yao, and Tao Mei. 2017b. Learning spatio-temporal representation with pseudo-3d residual networks. In ICCV .

[32]

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. In NIPS .

[33]

Mikel D Rodriguez, Javed Ahmed, and Mubarak Shah. 2008. Action mach a spatio-temporal maximum average correlation height filter for action recognition. In CVPR .

[34]

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et almbox. 2015. Imagenet large scale visual recognition challenge. International journal of computer vision, Vol. 115, 3 (2015), 211--252.

[35]

Suman Saha, Gurkirt Singh, Michael Sapienza, Philip HS Torr, and Fabio Cuzzolin. 2016. Deep learning for detecting multiple space-time action tubes in videos. In BMVC .

[36]

Karen Simonyan and Andrew Zisserman. 2014. Two-stream convolutional networks for action recognition in videos. In NIPS .

[37]

Karen Simonyan and Andrew Zisserman. 2015. Very deep convolutional networks for large-scale image recognition. In ICLR .

[38]

Gurkirt Singh, Suman Saha, and Fabio Cuzzolin. 2017. Online real time multiple spatiotemporal action localisation and prediction. In ICCV .

[39]

Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. 2012. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012).

[40]

Chen Sun, Abhinav Shrivastava, Carl Vondrick, Kevin Murphy, Rahul Sukthankar, and Cordelia Schmid. 2018. Actor-centric relation network. In ECCV .

[41]

Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3d convolutional networks. In ICCV .

[42]

Oytun Ulutan, Swati Rallapalli, Mudhakar Srivatsa, and BS Manjunath. 2018. Actor Conditioned Attention Maps for Video Action Detection. arXiv preprint arXiv:1812.11631 (2018).

[43]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In NIPS .

[44]

Heng Wang, Alexander Kl"aser, Cordelia Schmid, and Liu Cheng-Lin. 2011. Action recognition by dense trajectories. In CVPR .

[45]

Heng Wang and Cordelia Schmid. 2013. Action recognition with improved trajectories. In ICCV .

[46]

Shiyao Wang, Yucong Zhou, Junjie Yan, and Zhidong Deng. 2018. Fully Motion-Aware Network for Video Object Detection. In ECCV .

[47]

Xiaolong Wang and Abhinav Gupta. 2018. Videos as space-time region graphs. In ECCV .

[48]

Philippe Weinzaepfel, Zaid Harchaoui, and Cordelia Schmid. 2015. Learning to track for spatio-temporal action localization. In ICCV .

[49]

Chao-Yuan Wu, Christoph Feichtenhofer, Haoqi Fan, Kaiming He, Philipp Krahenbuhl, and Ross Girshick. 2019. Long-term feature banks for detailed video understanding. In CVPR .

[50]

Fanyi Xiao and Yong Jae Lee. 2018. Video object detection with an aligned spatial-temporal memory. In ECCV .

[51]

Zhenheng Yang, Jiyang Gao, and Ram Nevatia. 2017. Spatio-Temporal Action Detection with Cascade Proposal and Location Anticipation. In BMVC .

[52]

Bangpeng Yao and Li Fei-Fei. 2010. Modeling mutual context of object and human pose in human-object interaction activities. In CVPR .

[53]

Ting Yao, Yingwei Pan, Yehao Li, and Tao Mei. 2018. Exploring visual relationship for image captioning. In ECCV .

[54]

Ting Yao, Yingwei Pan, Yehao Li, and Tao Mei. 2019. Hierarchy Parsing for Image Captioning. In ICCV .

[55]

Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba. 2016. Learning deep features for discriminative localization. In CVPR .

[56]

Xizhou Zhu, Jifeng Dai, Lu Yuan, and Yichen Wei. 2018. Towards high performance video object detection. In CVPR .

[57]

Xizhou Zhu, Yujie Wang, Jifeng Dai, Lu Yuan, and Yichen Wei. 2017. Flow-guided feature aggregation for video object detection. In ICCV .

Cited By

Wang PZeng FQian Y(2024)A survey on deep learning-based spatio-temporal action detectionInternational Journal of Wavelets, Multiresolution and Information Processing10.1142/S021969132350066222:04Online publication date: 9-Feb-2024
https://doi.org/10.1142/S0219691323500662
Wang YZhao SChen S(2024)Action-Semantic Consistent Knowledge for Weakly-Supervised Action LocalizationIEEE Transactions on Multimedia10.1109/TMM.2024.340571026(10279-10289)Online publication date: 2024
https://doi.org/10.1109/TMM.2024.3405710
Shu YQiu ZLong FYao TNgo CMei T(2024)Learning Temporal Dynamics in Videos With Image TransformerIEEE Transactions on Multimedia10.1109/TMM.2024.338366226(8915-8927)Online publication date: 11-Apr-2024
https://dl.acm.org/doi/10.1109/TMM.2024.3383662
Show More Cited By

Index Terms

Long Short-Term Relation Networks for Video Action Detection
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
        Activity recognition and understanding

Recommendations

Attention-guided Boundary Refinement on Anchor-free Temporal Action Detection
Image Analysis
Abstract
Modelling temporal dependencies is important for accurate action detection. In this work, we develop a temporal attention unit to mine the global dependencies among features from different temporal locations. Additionally, based on the developed ...
Relation with Free Objects for Action Recognition
Relevant objects are widely used for aiding human action recognition in still images. Such objects are founded by a dedicated and pre-trained object detector in all previous methods. Such methods have two drawbacks. First, training an object detector ...
Unsupervised skeleton-based action representation learning via relation consistency pursuit
Abstract
In this paper, we propose a Skeleton-based Relation Consistency Learning scheme (SRCL) for unsupervised 3D action representation learning. By leveraging the inter-instance similarity score distribution as relation metric, SRCL is able to pursue ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '19: Proceedings of the 27th ACM International Conference on Multimedia

October 2019

2794 pages

ISBN:9781450368896

DOI:10.1145/3343031

General Chairs:
Laurent Amsaleg
CNRS-IRISA, France
,
Benoit Huet
EURECOM, France
,
Martha Larson
Radboud University and TU Delft (Netherlands)
,
Program Chairs:
Guillaume Gravier
CNRS-IRISA, France
,
Hayley Hung
Delft University of Technology Netherlands
,
Chong-Wah Ngo
City University of Hong Kong Hong Kong
,
Wei Tsang Ooi
National University of Singapore Singapore

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 October 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Natural Science Foundation of China

Conference

MM '19

Sponsor:

SIGMM

MM '19: The 27th ACM International Conference on Multimedia

October 21 - 25, 2019

Nice, France

Acceptance Rates

MM '19 Paper Acceptance Rate 252 of 936 submissions, 27%;

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

20
Total Citations
View Citations
463
Total Downloads

Downloads (Last 12 months)16
Downloads (Last 6 weeks)1

Reflects downloads up to 31 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Wang PZeng FQian Y(2024)A survey on deep learning-based spatio-temporal action detectionInternational Journal of Wavelets, Multiresolution and Information Processing10.1142/S021969132350066222:04Online publication date: 9-Feb-2024
https://doi.org/10.1142/S0219691323500662
Wang YZhao SChen S(2024)Action-Semantic Consistent Knowledge for Weakly-Supervised Action LocalizationIEEE Transactions on Multimedia10.1109/TMM.2024.340571026(10279-10289)Online publication date: 2024
https://doi.org/10.1109/TMM.2024.3405710
Shu YQiu ZLong FYao TNgo CMei T(2024)Learning Temporal Dynamics in Videos With Image TransformerIEEE Transactions on Multimedia10.1109/TMM.2024.338366226(8915-8927)Online publication date: 11-Apr-2024
https://dl.acm.org/doi/10.1109/TMM.2024.3383662
Wang HCheng QYu BZhan YTao DDing LLing H(2024)Free-Form Composition Networks for Egocentric Action RecognitionIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2024.340654634:10(9967-9978)Online publication date: Oct-2024
https://doi.org/10.1109/TCSVT.2024.3406546
Qian RFu ZLiu XZhang KLv ZLan X(2024)Video Visual Relation Detection Based on Trajectory Fusion2024 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN60899.2024.10650663(1-9)Online publication date: 30-Jun-2024
https://doi.org/10.1109/IJCNN60899.2024.10650663
Su SGan M(2024)Online spatio-temporal action detection with adaptive sampling and hierarchical modulationMultimedia Systems10.1007/s00530-024-01543-130:6Online publication date: 20-Nov-2024
https://doi.org/10.1007/s00530-024-01543-1
Cao QHuang H(2022)Attention Guided Relation Detection Approach for Video Visual Relation DetectionIEEE Transactions on Multimedia10.1109/TMM.2021.310943024(3896-3907)Online publication date: 2022
https://doi.org/10.1109/TMM.2021.3109430
Pramono RChen YFang W(2022)Spatial-Temporal Action Localization With Hierarchical Self-AttentionIEEE Transactions on Multimedia10.1109/TMM.2021.305689224(625-639)Online publication date: 2022
https://doi.org/10.1109/TMM.2021.3056892
Long FQiu ZPan YYao TLuo JMei T(2022)Stand-Alone Inter-Frame Attention in Video Models2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52688.2022.00319(3182-3191)Online publication date: Jun-2022
https://doi.org/10.1109/CVPR52688.2022.00319
Ni JQin JHuang DShen HZhuang YSmith JYang YCesar PMetze FPrabhakaran B(2021)Identity-aware Graph Memory Network for Action DetectionProceedings of the 29th ACM International Conference on Multimedia10.1145/3474085.3475503(3437-3445)Online publication date: 17-Oct-2021
https://dl.acm.org/doi/10.1145/3474085.3475503
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten