Action Recognition in Untrimmed Videos with Composite Self-attention Two-Stream Framework

Cao, Dong; Xu, Lisha; Chen, HaiBo

doi:10.1007/978-3-030-41299-9_3

Dong Cao^12,13,
Lisha Xu^12,13 &
HaiBo Chen¹³

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12047))

Included in the following conference series:

Asian Conference on Pattern Recognition

1313 Accesses
4 Citations

Abstract

With the rapid development of deep learning algorithms, action recognition in video has achieved many important research results. One issue in action recognition, Zero-Shot Action Recognition (ZSAR), has recently attracted considerable attention, which classify new categories without any positive examples. Another difficulty in action recognition is that untrimmed data may seriously affect model performance. We propose a composite two-stream framework with a pre-trained model. Our proposed framework includes a classifier branch and a composite feature branch. The graph network model is adopted in each of the two branches, which effectively improves the feature extraction and reasoning ability of the framework. In the composite feature branch, 3-channel self-attention modules are constructed to weight each frame of the video and give more attention to the key frames. Each self-attention channel outputs a set of attention weights to focus on the particular stage of the video, and a set of attention weights corresponds to a one-dimensional vector. The 3-channel self-attention modules can inference key frames from multiple aspects. The output sets of attention weight vectors form an attention matrix, which effectively enhances the attention of key frames with strong correlation of action. This model can also implement action recognition under zero-shot conditions, and has good recognition performance for untrimmed video data. Experimental results on relevant datasets confirm the validity of our model.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

DVAMN: Dual Visual Attention Matching Network for Zero-Shot Action Recognition

Content-Aware Attention Network for Action Recognition

Attention-Based Video Disentangling and Matching Network for Zero-Shot Action Recognition

References

Carreira, J., Zisserman, A.: Quo vadis, action recognition? A new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR (2017)
Google Scholar
Dosovitskiy, A., et al.: Flownet: learning optical flow with convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, ICCV (2015)
Google Scholar
Fan, L., Huang, W., Chuang G., Ermon, S., Gong, B., Huang, J.: End-to-end learning of motion representation for video understanding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR (2018)
Google Scholar
Feichtenhofer, C., Pinz, A., Zisserman, A.: Convolutional two-stream network fusion for video action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR (2016)
Google Scholar
Gao, J., Zhu, T., Xu, C.: I know the relationships: zero-shot action recognition via two-stream graph convolutional networks and knowledge graphs. In: AAAI (2019)
Google Scholar
Ng, J.Y.-H., Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga, R., Toderici, G.: Beyond short snippets: deepnetworks for video classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR (2015)
Google Scholar
Jain, M., Gemert, J.C., Snoek, C.G.M: What do 15000 object categories tell us about classifying and localizing actions? In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR (2015)
Google Scholar
Ji, S., Xu, W., Yang, M., Yu, K.: 3D convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 35(1), 221–231 (2013)
Article Google Scholar
Mishra, A., Verma, V.K., Reddy, M.S.K., Arulkumar, S., Rai, P., Mittal, A.: A generative approach to zero-shot and few-shot action recognition. In: WACV (2018)
Google Scholar
Norouzi, M., et al.: Zero-shot learning by convex combination of semantic embeddings. In: ICLR (2014)
Google Scholar
Simonyan, K., Zisserman A.: Two-stream convolutional networks for action recognition in videos. In: Advances in Neural Information Processing Systems (NIPS), pp. 568–576 (2014)
Google Scholar
Song, J., Shen, C., Yang, Y., Liu, Y., Song, M.: Transductive unbiased embedding for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR (2018)
Google Scholar
Sun, S., Kuang, Z., Sheng, L., Ouyang, W., Zhang, W.: Optical flow guided feature: a fast and robust motion representation for video action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR (2018)
Google Scholar
Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR (2015)
Google Scholar
Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., Paluri, M.: A closer look at spatiotemporal convolutions for action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR (2018)
Google Scholar
Wang, H., Schmid, C.: Action recognition with improved trajectories. In: ICCV, pp. 3551–3558 (2013)
Google Scholar
Wang, L., Xiong, Y., Lin, D., Gool, L.: UntrimmedNets for weakly supervised action recognition and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR (2017)
Google Scholar
Wang, L., et al.: Temporal segment networks: towards good practices for deep action recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 20–36. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_2
Chapter Google Scholar
Xu, X., Hospedales, T.M., Gong, S.: Multi-task zero-shot action recognition with prioritised data augmentation. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9906, pp. 343–359. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46475-6_22
Chapter Google Scholar
Zhu, W., Hu, J., Sun, G., Cao, X., Qiao, Y.: A key volume mining deep framework for action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR (2016)
Google Scholar
Zhu, Y., Newsam, S.: Depth2Action: exploring embedded depth for large-scale action recognition. In: Hua, G., Jégou, H. (eds.) ECCV 2016. LNCS, vol. 9913, pp. 668–684. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46604-0_47
Chapter Google Scholar
Zhu, Y., Long, Y., Guan, Y., Newsam, S., Shao, L.: Towards universal representation for unseen action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR (2018)
Google Scholar
Farhadi, A., Endres, I., Hoiem, D., Forsyth, D.: Describing objects by their attributes. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR (2009)
Google Scholar
Lampert, C.H., Nickisch, H., Harmeling, S.: Learning to detect unseen object classes by between-class attribute transfer. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR (2009)
Google Scholar
Akata, Z., Perronnin, F., Harchaoui, Z., Schmid, C.: Label embedding for attribute-based classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR (2013)
Google Scholar
Morgado, P., Vasconcelos, N.: Semantically consistent regularization for zero-shot recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR (2017)
Google Scholar
Xian, Y., Lampert, C.H., Schiele, B., Akata, Z.: Zero-shot learning-a comprehensive evaluation of the good, the bad and the ugly. In: IEEE Transactions on Pattern Analysis and Machine Intelligence (2017)
Google Scholar
Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Mikolov, T., et al.: Devise: a deep visual-semantic embedding model. In: Advances in Neural Information Processing Systems, pp. 2121–2129 (2013)
Google Scholar
Reed, S., Akata, Z., Lee, H., Schiele, B.: Learning deep representations of fine-grained visual descriptions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR (2016)
Google Scholar
Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR (2017)
Google Scholar
Qin, J., et al.: Zero-shot action recognition with error-correcting output codes. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR (2017)
Google Scholar
Niebles, J.C., Chen, C.-W., Fei-Fei, L.: Modeling temporal structure of decomposable motion segments for activity classification. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6312, pp. 392–405. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-15552-9_29
Chapter Google Scholar
Pan, S.J., Yang, Q.: A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 22(10), 1345–1359 (2010)
Article Google Scholar
Ganin, Y., Lempitsky, V.: Unsupervised domain adaptation by backpropagation. In: International Conference on Machine Learning (2015)
Google Scholar
Long, M., Cao, Y., Wang, J., Jordan, M.I.: Learning transferable features with deep adaptation networks. In: International Conference on Machine Learning (2015)
Google Scholar
Gretton, A., Borgwardt, K.M., Rasch, M.J., Scholkopf, B., Smola, A.J.: A kernel two-sample test. JMLR 13, 723–773 (2012)
MathSciNet MATH Google Scholar
Tzeng, E., Hoffman, J., Saenko, K., Darrell, T.: Adversarial discriminative domain adaptation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR (2017)
Google Scholar
Zhang, X., Shi, H., Li, C., Zheng, K., Zhu, X., Duan, L.: Learning transferable self-attentive representations for action recognition in untrimmed videos with weak supervision. In: AAAI (2019)
Google Scholar
Xu, X., Hospedales, T., Gong, S.: Transductive zero-shot action recognition by word-vector embedding. Int. J. Comput. Vis. 123(3), 309–333 (2017)
Article MathSciNet Google Scholar

Download references

Acknowledgments

We are very grateful to DeepBlue Technology (Shanghai) Co., Ltd. and DeepBlue Academy of Sciences for their support. Thanks to the support of the Equipment pre-research project (No. 31511060502). Thanks to Dr. Dongdong Zhang of the DeepBlue Academy of Sciences.

Author information

Authors and Affiliations

Institute of Cognitive Intelligence, DeepBlue Academy of Sciences, Shanghai, China
Dong Cao & Lisha Xu
DeepBlue Technology (Shanghai) Co., Ltd., No. 369, Weining Road, Shanghai, China
Dong Cao, Lisha Xu & HaiBo Chen

Authors

Dong Cao
View author publications
You can also search for this author in PubMed Google Scholar
Lisha Xu
View author publications
You can also search for this author in PubMed Google Scholar
HaiBo Chen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dong Cao .

Editor information

Editors and Affiliations

University of Malaya, Kuala Lumpur, Malaysia
Shivakumara Palaiahnakote
Consiglio Nazionale delle Ricerche, ICAR, Naples, Italy
Gabriella Sanniti di Baja
Chinese Academy of Sciences, Beijing, China
Liang Wang
Auckland University of Technology, Auckland, New Zealand
Wei Qi Yan

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Cao, D., Xu, L., Chen, H. (2020). Action Recognition in Untrimmed Videos with Composite Self-attention Two-Stream Framework. In: Palaiahnakote, S., Sanniti di Baja, G., Wang, L., Yan, W. (eds) Pattern Recognition. ACPR 2019. Lecture Notes in Computer Science(), vol 12047. Springer, Cham. https://doi.org/10.1007/978-3-030-41299-9_3

Download citation

DOI: https://doi.org/10.1007/978-3-030-41299-9_3
Published: 23 February 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-41298-2
Online ISBN: 978-3-030-41299-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Action Recognition in Untrimmed Videos with Composite Self-attention Two-Stream Framework

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

DVAMN: Dual Visual Attention Matching Network for Zero-Shot Action Recognition

Content-Aware Attention Network for Action Recognition

Attention-Based Video Disentangling and Matching Network for Zero-Shot Action Recognition

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Action Recognition in Untrimmed Videos with Composite Self-attention Two-Stream Framework

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

DVAMN: Dual Visual Attention Matching Network for Zero-Shot Action Recognition

Content-Aware Attention Network for Action Recognition

Attention-Based Video Disentangling and Matching Network for Zero-Shot Action Recognition

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation