Learning multi-temporal-scale deep information for action recognition

Yao, Guangle; Lei, Tao; Zhong, Jiandan; Jiang, Ping

doi:10.1007/s10489-018-1347-3

Learning multi-temporal-scale deep information for action recognition

Published: 01 December 2018

Volume 49, pages 2017–2029, (2019)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

Guangle Yao^1,2,3,
Tao Lei¹,
Jiandan Zhong^1,2,3 &
…
Ping Jiang¹

887 Accesses
24 Citations
Explore all metrics

Abstract

Action recognition in video is widely applied in video indexing, intelligent surveillance, multimedia understanding, and other fields. A typical human action contains the spatiotemporal information from various scales. Learning and fusing the multi-temporal-scale information make action recognition more reliable in terms of recognition accuracy. To demonstrate this argument, in this paper, we use Res3D, a 3D Convolution Neural Network (CNN) architecture, to extract information in multiple temporal scales. And in each temporal scale, we transfer the knowledge learned from RGB to 3-channel optical flow (OF) and learn information from RGB and OF fields. We also propose Parallel Pair Discriminant Correlation Analysis (PPDCA) to fuse the multi-temporal-scale information into action representation with a lower dimension. Experimental results show that compared with single-temporal-scale method, the proposed multi-temporal-scale method gains higher recognition accuracy, and spends more time on feature extraction, but less time on classification due to the representation with lower dimension. Moreover, the proposed method achieves recognition performance comparable to that of the state-of-the-art methods. The source code and 3D filter animations are available online: https://github.com/JerryYaoGl/multi-temporal-scale.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1

Fig. 5

Multi-stream with Deep Convolutional Neural Networks for Human Action Recognition in Videos

Exploring hybrid spatio-temporal convolutional networks for human action recognition

Article 08 March 2017

Improved SSD using deep multi-scale attention spatial–temporal features for action recognition

Article 14 July 2021

Find the latest articles, discoveries, and news in related topics.

Artificial Intelligence

References

LeCun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278–2324
Article Google Scholar
Krizhevsky A, Sutskever I, Hinton G (2012) Imagenet classification with deep convolutional neural networks. In: Proceedings of the annual conference on neural information processing systems, pp 1097–1105
Girshick R, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 580–587
Farabet C, Couprie C, LeCun Y (2013) Learning hierarchical features for scene labeling. IEEE Trans Pattern Anal Mach Intell 35(8):1915–1929
Article Google Scholar
Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3431–3440
Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Li F (2014) Large-scale video classification with convolutional neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1725–1732
Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: Proceedings of the advances in neural information processing systems, pp 568–576
Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3D convolutional networks. In: Proceedings of the IEEE international conference on computer vision, pp 4489–4497
Tran D, Ray J, Shou Z, Chang SF, Paluri M (2017) ConvNet architecture search for spatiotemporal feature learning. arXiv:1708.05038
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Haghighat M, Abdel-Mottaleb M, Alhalabi W (2016) Discriminant correlation analysis: real-time feature level fusion for multimodal biometric recognition. IEEE Trans Inf Foren Sec 11(9):1984–1966
Article Google Scholar
Lin Z, Jiang Z, Davis L (2009) Recognizing actions by shape-motion prototype trees. In: Proceedings of the IEEE international conference on computer vision, pp 444–451
Efros A, Berg A, Mori G, Malik J (2003) Recognizing action at a distance. In: Proceedings of the IEEE international conference on computer vision, pp 726–733
Wang H, Schmid C (2013) Action recognition with improved trajectories. In: Proceedings of the IEEE international conference on computer vision, pp 3551-3558
Peng X, Wang L, Wang X, Qiao Y (2016) Bag of visual words and fusion methods for action recognition. Comput Vis Image Underst 150:109–125
Article Google Scholar
Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 886–893
Laptev I, Marszalek M, Schmid C, Rozenfeld B (2008) Learning realistic human actions from movies. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1–8
Dalal N, Triggs B, Schmid C (2006) Human detection using oriented histograms of flow and appearance. In: Proceedings of the European conference on computer vision, pp 428–441
Perronnin F, Dance C (2007) Fisher kernels on visual vocabularies for image categorization. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1-8
Yang M, Ji S, Xu W, Wang J (2009) Detecting human actions in surveillance videos. In: Proceedings of the TREC video retrieval evaluation workshop
Ji S, Xu W, Yang M, Yu K (2010) 3D convolutional neural networks for human action recognition. In: Proceedings of the International conference on machine learning, pp 495– 502
Baccouche M, Mamalet F, Wolf C, Carcia C, Baskurt A (2011) Sequential deep learning for human action recognition. In: Proceedings of the International conference on human behavior unterstanding, pp 29–39
Ji S, Xu W, Yang M, Yu K (2013) 3D convolutional neural networks for human action recognition. IEEE Trans Pattern Anal Mach Intell 35(1):221–231
Article Google Scholar
Zhang B, Wang L, Wang Z, Qiao Y, Wang H (2016) Real-time action recognition with enhanced motion vector CNNs. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2718–2726
Gao Z, Hua G, Zhang D, Jojic N, Wang L (2017) ER3: A unified framework for event retrieval recognition and recounting. In: Proceedings of the IEEE conference on computer vision and pattern recognition
Varol G, Laptev I, Schmid C (2018) Long-term temporal convolutions for action recognition. IEEE Trans Pattern Anal Mach Intell 40(6):1510–1517
Article Google Scholar
Wang L, Xiong Y, Wang Z, Qiao Y (2015) Towards good practices for very deep two-stream convnets. arXiv:1507.02159.2015
Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M, Berg A, Li F (2015) Imagenet large scale visual recognition challenge. Int J Comput Vis 115(3):211–252
Article MathSciNet Google Scholar
Zeiler M, Fergus R (2014) Visualizing and understanding convolutional networks. In: Proceedings of the European conference on computer vision, pp 818-833
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. In: Proceedings of the international conference on learning representation
Chatfield K, Simonyan K, Vedaldi A, Zisserman A (2014) Return of the devil in the details: delving deep into convolutional nets. In: Proceedings of the British machine vision conference
Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1-9
Soomro K, Zamir A, Shah M (2012) UCF101: A dataset of 101 human actions classes from videos in the wild. Technical Report, University of Central Florida
Kuehne H, Jhuang H, Garrote E, Poggio T, Serre T (2011) HMDB: A large video database for human motion recognition. In: Proceedings of the IEEE international conference on computer vision
Maaten LVD, Hinton G (2008) Visualizing data using t-sne. J Mach Learn Res 9:2579–2605
MATH Google Scholar
Zach C, Pock T, Bischof H (2007) A duality based approach for realtime tv-L1 optical flow. In: Proceedings of DAGM symposium on pattern recognition, pp 214-223
Chang CC, Lin CJ (2011) Libsvm: a library for support vector machines. ACM Trans Intel Syst Tec 2 (3):1–27
Article Google Scholar
Ng J, Hausknecht M, Vijayanarasimhan S, Vinyals O, Monga R, Toderici G (2015) Beyond short snippets: deep networks for video classification. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4694-4702
Donahue J, Hendricks L, Rohrbach M, Venugopalan S, Guadarrama S, Saenko K, Darrell T (2017) Long-term recurrent convolutional networks for visual recognition and description. IEEE Trans Pattern Anal Mach Intell 39(4):677–691
Article Google Scholar
Wang L, Qiao Y, Tang X (2015) Action recognition with trajectory-pooled deep-convolutional descriptors. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4305-4314
Cherian A, Fernando B, Harandi M, Gould S (2017) Generalized rank pooling for activity recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1581-1590
Kar A, Rai N, Sikka K, Sharma G (2017) AdaScan: adaptive scan pooling in deep convolutional neural networks for human action recognition in videos. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5699-5708
Park E, Han X, Berg T, Berg A (2016) Combining multiple sources of knowledge in deep CNNs for action recognition. In: Proceedings of the IEEE winter conference on applications of computer vision, pp 177-186
Wu Z, Wang X, Jiang Y, Ye H, Xue X (2018) Modeling spatial-temporal clues in a hybrid deep learning framework for video classification. In: Proceedings of the ACM multimedia conference
Li Y, Li W, Mahadevan V, Vasconcelos N (2016) VLAD3: encoding dynamics of deep features for action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1951-1960
Neverova N, Wolf C, Taylor W, Nebout F (2014) Multi-scale deep learning for gesture detection and localization. In: Workshop of the European conference on computer vision, pp 474-490
Jung M, Hwang J, Tani J (2014) Multiple spatio-temporal scales neural network for contextual visual recognition of human actions. In: Proceedings of the IEEE conferences on development and learning and epigenetic robotics, pp 235-241

Download references

Acknowledgements

This work was supported by Youth Innovation Promotion Association, Chinese Academy of Sciences (Grant No. 2016336).

Author information

Authors and Affiliations

Institute of Optics and Electronics, Chinese Academy of Sciences, Chengdu, China
Guangle Yao, Tao Lei, Jiandan Zhong & Ping Jiang
University of Electronic Science and Technology of China, Chengdu, China
Guangle Yao & Jiandan Zhong
University of Chinese Academy of Sciences, Beijing, China
Guangle Yao & Jiandan Zhong

Authors

Guangle Yao
View author publications
You can also search for this author in PubMed Google Scholar
Tao Lei
View author publications
You can also search for this author in PubMed Google Scholar
Jiandan Zhong
View author publications
You can also search for this author in PubMed Google Scholar
Ping Jiang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tao Lei.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Yao, G., Lei, T., Zhong, J. et al. Learning multi-temporal-scale deep information for action recognition. Appl Intell 49, 2017–2029 (2019). https://doi.org/10.1007/s10489-018-1347-3

Download citation

Published: 01 December 2018
Issue Date: 15 June 2019
DOI: https://doi.org/10.1007/s10489-018-1347-3

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Learning multi-temporal-scale deep information for action recognition

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Multi-stream with Deep Convolutional Neural Networks for Human Action Recognition in Videos

Exploring hybrid spatio-temporal convolutional networks for human action recognition

Improved SSD using deep multi-scale attention spatial–temporal features for action recognition

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Learning multi-temporal-scale deep information for action recognition

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Multi-stream with Deep Convolutional Neural Networks for Human Action Recognition in Videos

Exploring hybrid spatio-temporal convolutional networks for human action recognition

Improved SSD using deep multi-scale attention spatial–temporal features for action recognition

Explore related subjects

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation