Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Learning multi-temporal-scale deep information for action recognition

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Action recognition in video is widely applied in video indexing, intelligent surveillance, multimedia understanding, and other fields. A typical human action contains the spatiotemporal information from various scales. Learning and fusing the multi-temporal-scale information make action recognition more reliable in terms of recognition accuracy. To demonstrate this argument, in this paper, we use Res3D, a 3D Convolution Neural Network (CNN) architecture, to extract information in multiple temporal scales. And in each temporal scale, we transfer the knowledge learned from RGB to 3-channel optical flow (OF) and learn information from RGB and OF fields. We also propose Parallel Pair Discriminant Correlation Analysis (PPDCA) to fuse the multi-temporal-scale information into action representation with a lower dimension. Experimental results show that compared with single-temporal-scale method, the proposed multi-temporal-scale method gains higher recognition accuracy, and spends more time on feature extraction, but less time on classification due to the representation with lower dimension. Moreover, the proposed method achieves recognition performance comparable to that of the state-of-the-art methods. The source code and 3D filter animations are available online: https://github.com/JerryYaoGl/multi-temporal-scale.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Explore related subjects

Find the latest articles, discoveries, and news in related topics.

References

  1. LeCun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278–2324

    Article  Google Scholar 

  2. Krizhevsky A, Sutskever I, Hinton G (2012) Imagenet classification with deep convolutional neural networks. In: Proceedings of the annual conference on neural information processing systems, pp 1097–1105

  3. Girshick R, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 580–587

  4. Farabet C, Couprie C, LeCun Y (2013) Learning hierarchical features for scene labeling. IEEE Trans Pattern Anal Mach Intell 35(8):1915–1929

    Article  Google Scholar 

  5. Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3431–3440

  6. Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Li F (2014) Large-scale video classification with convolutional neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1725–1732

  7. Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: Proceedings of the advances in neural information processing systems, pp 568–576

  8. Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3D convolutional networks. In: Proceedings of the IEEE international conference on computer vision, pp 4489–4497

  9. Tran D, Ray J, Shou Z, Chang SF, Paluri M (2017) ConvNet architecture search for spatiotemporal feature learning. arXiv:1708.05038

  10. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778

  11. Haghighat M, Abdel-Mottaleb M, Alhalabi W (2016) Discriminant correlation analysis: real-time feature level fusion for multimodal biometric recognition. IEEE Trans Inf Foren Sec 11(9):1984–1966

    Article  Google Scholar 

  12. Lin Z, Jiang Z, Davis L (2009) Recognizing actions by shape-motion prototype trees. In: Proceedings of the IEEE international conference on computer vision, pp 444–451

  13. Efros A, Berg A, Mori G, Malik J (2003) Recognizing action at a distance. In: Proceedings of the IEEE international conference on computer vision, pp 726–733

  14. Wang H, Schmid C (2013) Action recognition with improved trajectories. In: Proceedings of the IEEE international conference on computer vision, pp 3551-3558

  15. Peng X, Wang L, Wang X, Qiao Y (2016) Bag of visual words and fusion methods for action recognition. Comput Vis Image Underst 150:109–125

    Article  Google Scholar 

  16. Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 886–893

  17. Laptev I, Marszalek M, Schmid C, Rozenfeld B (2008) Learning realistic human actions from movies. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1–8

  18. Dalal N, Triggs B, Schmid C (2006) Human detection using oriented histograms of flow and appearance. In: Proceedings of the European conference on computer vision, pp 428–441

  19. Perronnin F, Dance C (2007) Fisher kernels on visual vocabularies for image categorization. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1-8

  20. Yang M, Ji S, Xu W, Wang J (2009) Detecting human actions in surveillance videos. In: Proceedings of the TREC video retrieval evaluation workshop

  21. Ji S, Xu W, Yang M, Yu K (2010) 3D convolutional neural networks for human action recognition. In: Proceedings of the International conference on machine learning, pp 495– 502

  22. Baccouche M, Mamalet F, Wolf C, Carcia C, Baskurt A (2011) Sequential deep learning for human action recognition. In: Proceedings of the International conference on human behavior unterstanding, pp 29–39

  23. Ji S, Xu W, Yang M, Yu K (2013) 3D convolutional neural networks for human action recognition. IEEE Trans Pattern Anal Mach Intell 35(1):221–231

    Article  Google Scholar 

  24. Zhang B, Wang L, Wang Z, Qiao Y, Wang H (2016) Real-time action recognition with enhanced motion vector CNNs. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2718–2726

  25. Gao Z, Hua G, Zhang D, Jojic N, Wang L (2017) ER3: A unified framework for event retrieval recognition and recounting. In: Proceedings of the IEEE conference on computer vision and pattern recognition

  26. Varol G, Laptev I, Schmid C (2018) Long-term temporal convolutions for action recognition. IEEE Trans Pattern Anal Mach Intell 40(6):1510–1517

    Article  Google Scholar 

  27. Wang L, Xiong Y, Wang Z, Qiao Y (2015) Towards good practices for very deep two-stream convnets. arXiv:1507.02159.2015

  28. Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M, Berg A, Li F (2015) Imagenet large scale visual recognition challenge. Int J Comput Vis 115(3):211–252

    Article  MathSciNet  Google Scholar 

  29. Zeiler M, Fergus R (2014) Visualizing and understanding convolutional networks. In: Proceedings of the European conference on computer vision, pp 818-833

  30. Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. In: Proceedings of the international conference on learning representation

  31. Chatfield K, Simonyan K, Vedaldi A, Zisserman A (2014) Return of the devil in the details: delving deep into convolutional nets. In: Proceedings of the British machine vision conference

  32. Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1-9

  33. Soomro K, Zamir A, Shah M (2012) UCF101: A dataset of 101 human actions classes from videos in the wild. Technical Report, University of Central Florida

  34. Kuehne H, Jhuang H, Garrote E, Poggio T, Serre T (2011) HMDB: A large video database for human motion recognition. In: Proceedings of the IEEE international conference on computer vision

  35. Maaten LVD, Hinton G (2008) Visualizing data using t-sne. J Mach Learn Res 9:2579–2605

    MATH  Google Scholar 

  36. Zach C, Pock T, Bischof H (2007) A duality based approach for realtime tv-L1 optical flow. In: Proceedings of DAGM symposium on pattern recognition, pp 214-223

  37. Chang CC, Lin CJ (2011) Libsvm: a library for support vector machines. ACM Trans Intel Syst Tec 2 (3):1–27

    Article  Google Scholar 

  38. Ng J, Hausknecht M, Vijayanarasimhan S, Vinyals O, Monga R, Toderici G (2015) Beyond short snippets: deep networks for video classification. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4694-4702

  39. Donahue J, Hendricks L, Rohrbach M, Venugopalan S, Guadarrama S, Saenko K, Darrell T (2017) Long-term recurrent convolutional networks for visual recognition and description. IEEE Trans Pattern Anal Mach Intell 39(4):677–691

    Article  Google Scholar 

  40. Wang L, Qiao Y, Tang X (2015) Action recognition with trajectory-pooled deep-convolutional descriptors. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4305-4314

  41. Cherian A, Fernando B, Harandi M, Gould S (2017) Generalized rank pooling for activity recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1581-1590

  42. Kar A, Rai N, Sikka K, Sharma G (2017) AdaScan: adaptive scan pooling in deep convolutional neural networks for human action recognition in videos. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5699-5708

  43. Park E, Han X, Berg T, Berg A (2016) Combining multiple sources of knowledge in deep CNNs for action recognition. In: Proceedings of the IEEE winter conference on applications of computer vision, pp 177-186

  44. Wu Z, Wang X, Jiang Y, Ye H, Xue X (2018) Modeling spatial-temporal clues in a hybrid deep learning framework for video classification. In: Proceedings of the ACM multimedia conference

  45. Li Y, Li W, Mahadevan V, Vasconcelos N (2016) VLAD3: encoding dynamics of deep features for action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1951-1960

  46. Neverova N, Wolf C, Taylor W, Nebout F (2014) Multi-scale deep learning for gesture detection and localization. In: Workshop of the European conference on computer vision, pp 474-490

  47. Jung M, Hwang J, Tani J (2014) Multiple spatio-temporal scales neural network for contextual visual recognition of human actions. In: Proceedings of the IEEE conferences on development and learning and epigenetic robotics, pp 235-241

Download references

Acknowledgements

This work was supported by Youth Innovation Promotion Association, Chinese Academy of Sciences (Grant No. 2016336).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tao Lei.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yao, G., Lei, T., Zhong, J. et al. Learning multi-temporal-scale deep information for action recognition. Appl Intell 49, 2017–2029 (2019). https://doi.org/10.1007/s10489-018-1347-3

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-018-1347-3

Keywords