research-article

Free access

I know the relationships: zero-shot action recognition via two-stream graph convolutional networks and knowledge graphs

AUTHORs:

Changsheng XuAuthors Info & Claims

AAAI'19/IAAI'19/EAAI'19: Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence and Thirty-First Innovative Applications of Artificial Intelligence Conference and Ninth AAAI Symposium on Educational Advances in Artificial Intelligence

Article No.: 1018, Pages 8303 - 8311

https://doi.org/10.1609/aaai.v33i01.33018303

Published: 27 January 2019 Publication History

PDF eReader Publisher Site

Abstract

Recently, with the ever-growing action categories, zero-shot action recognition (ZSAR) has been achieved by automatically mining the underlying concepts (e.g., actions, attributes) in videos. However, most existing methods only exploit the visual cues of these concepts but ignore external knowledge information for modeling explicit relationships between them. In fact, humans have remarkable ability to transfer knowledge learned from familiar classes to recognize unfamiliar classes. To narrow the knowledge gap between existing methods and humans, we propose an end-to-end ZSAR framework based on a structured knowledge graph, which can jointly model the relationships between action-attribute, action-action, and attribute-attribute. To effectively leverage the knowledge graph, we design a novel Two-Stream Graph Convolutional Network (TS-GCN) consisting of a classifier branch and an instance branch. Specifically, the classifier branch takes the semantic-embedding vectors of all the concepts as input, then generates the classifiers for action categories. The instance branch maps the attribute embeddings and scores of each video instance into an attribute-feature space. Finally, the generated classifiers are evaluated on the attribute features of each video, and a classification loss is adopted for optimizing the whole network. In addition, a self-attention module is utilized to model the temporal information of videos. Extensive experimental results on three realistic action benchmarks Olympic Sports, HMDB51 and UCF101 demonstrate the favorable performance of our proposed framework.

References

[1]

Abadi, M.; Barham, P.; Chen, J.; Chen, Z.; Davis, A.; Dean, J.; Devin, M.; Ghemawat, S.; Irving, G.; Isard, M.; et al. 2016. Tensorflow: a system for large-scale machine learning. In OSDI.

[2]

Akata, Z.; Reed, S.; Walter, D.; Lee, H.; and Schiele, B. 2015. Evaluation of output embeddings for fine-grained image classification. In CVPR.

[3]

Auer, S.; Bizer, C.; Kobilarov, G.; Lehmann, J.; Cyganiak, R.; and Ives, Z. 2007. Dbpedia: A nucleus for a web of open data. In The Semantic Web. Springer. 722–735.

Digital Library

[4]

Bond, F., and Foster, R. 2013. Linking and extending an open multilingual wordnet. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, 1352–1362.

[5]

Fang, Y.; Kuan, K.; Lin, J.; Tan, C.; and Chandrasekhar, V. 2017. Object detection meets knowledge graphs. In IJCAI.

[6]

Gan, C.; Lin, M.; Yang, Y.; Zhuang, Y.; and Hauptmann, A. G. 2015. Exploring semantic interclass relationships (sir) for zero-shot action recognition. In AAAI.

[7]

Gan, C.; Lin, M.; Yang, Y.; de Melo, G.; and Hauptmann, A. G. 2016a. Concepts not alone: Exploring pairwise relationships for zero-shot video activity recognition. In AAAI.

[8]

Gan, C.; Yang, Y.; Zhu, L.; Zhao, D.; and Zhuang, Y. 2016b. Recognizing an action using its name: A knowledge-based approach. International Journal of Computer Vision 120(1):61–77.

Digital Library

[9]

Gan, C.; Yang, T.; and Gong, B. 2016. Learning attributes equals multi-source domain generalization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 87–97.

[10]

Gao, J.; Zhang, T.; Yang, X.; and Xu, C. 2017. Deep relative tracking. IEEE Transactions on Image Processing 26(4):1845–1858.

Digital Library

[11]

Gao, J.; Zhang, T.; Yang, X.; and Xu, C. 2018. P2t: Part-to-target tracking via deep regression learning. IEEE Transactions on Image Processing 27(6):3074–3086.

[12]

Gao, J.; Zhang, T.; and Xu, C. 2017. A unified personalized video recommendation via dynamic recurrent neural networks. In ACM MM, 127–135.

[13]

Gao, J.; Zhang, T.; and Xu, C. 2018. Watch, think and attend: End-to-end video classification via dynamic knowledge evolution modeling. In 2018 ACM Multimedia Conference on Multimedia Conference, 690–699. ACM.

[14]

Han, J.; Zhang, D.; Cheng, G.; Liu, N.; and Xu, D. 2018. Advanced deep-learning techniques for salient and category-specific object detection: a survey. IEEE Signal Processing Magazine 35(1):84-100.

[15]

Jain, M.; van Gemert, J. C; Mensink, T.; and Snoek, C. G. 2015. Objects2action: Classifying and localizing actions without any video example. In ICCV.

[16]

Kingma, D. P., and Ba, J. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv: 1412.6980.

[17]

Kipf, T. N., and Welling, M. 2016. Semi-supervised classification with graph convolutional networks. In ICLR.

[18]

Kodirov, E.; Xiang, T.; Fu, Z.; and Gong, S. 2015. Unsupervised domain adaptation for zero-shot learning. In ICCV.

[19]

Kuehne, H.; Jhuang, H.; Garrote, E.; Poggio, T.; and Serre, T. 2011. HMDB: a large video database for human motion recognition. In ICCV.

[20]

Lampert, C. H.; Nickisch, H.; and Harmeling, S. 2009. Learning to detect unseen object classes by between-class attribute transfer. In CVPR.

[21]

Lee, C.-W.; Fang, W.; Yeh, C.-K.; and Wang, Y.-C. F. 2018. Multi-label zero-shot learning with structured knowledge graphs. In CVPR.

[22]

Liu, J.; Kuipers, B.; and Savarese, S. 2011. Recognizing human actions by attributes. In CVPR.

[23]

Maaten, L. v. d., and Hinton, G. 2008. Visualizing data using t-sne. Journal of machine learning research 9(Nov):2579-2605.

Digital Library

[24]

Marino, K.; Salakhutdinov, R.; and Gupta, A. 2017. The more you know: Using knowledge graphs for image classification. In CVPR.

[25]

Mettes, P., and Snoek, C. G. 2017. Spatial-aware object embeddings for zero-shot localization and classification of actions. In ICCV.

[26]

Mettes, P.; Koelma, D. C; and Snoek, C. G. 2016. The imagenet shuffle: Reorganized pre-training for video event detection. In ICMR.

[27]

Mikolov, T.; Chen, K.; Corrado, G.; and Dean, J. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.

[28]

Mishra, A.; Verma, V. K.; Reddy, M.; Rai, P.; Mittal, A.; et al. 2018. A generative approach to zero-shot and few-shot action recognition. arXiv preprint arXiv: 1801.09086.

[29]

Niebles, J. C; Chen, C.-W.; and Fei-Fei, L. 2010. Modeling temporal structure of decomposable motion segments for activity classification. In ECCV.

[30]

Norouzi, M.; Mikolov, T.; Bengio, S.; Singer, Y.; Shlens, J.; Frome, A.; Corrado, G. S.; and Dean, J. 2013. Zero-shot learning by convex combination of semantic embeddings. arXiv preprint arXiv: 1312.5650.

[31]

Qin, J.; Liu, L.; Shao, L.; Shen, F.; Ni, B.; Chen, J.; and Wang, Y. 2017. Zero-shot action recognition with error-correcting output codes. In CVPR.

[32]

Romera-Paredes, B., and Torr, P. 2015. An embarrassingly simple approach to zero-shot learning. In ICML.

[33]

Song, J.; Shen, C.; Yang, Y.; Liu, Y.; and Song, M. 2018. Transductive unbiased embedding for zero-shot learning. In CVPR.

[34]

Soomro, K.; Zamir, A. R.; and Shah, M. 2012. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402.

[35]

Speer, R.; Chin, J.; and Havasi, C. 2017. Conceptnet 5.5: An open multilingual graph of general knowledge. In AAAI.

[36]

Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; and Rabinovich, A. 2015. Going deeper with convolutions. In CVPR.

[37]

Thomee, B.; Shamma, D. A.; Friedland, G.; Elizalde, B.; Ni, K.; Poland, D.; Borth, D.; and Li, L.-J. 2016. Yfcc100m: The new data in multimedia research. Communications of the ACM 59(2):64–73.

Digital Library

[38]

Tran, D.; Bourdev, L.; Fergus, R.; Torresani, L.; and Paluri, M. 2015. Learning spatiotemporal features with 3d convolutional networks. In ICCV.

[39]

Wang, X.; Ye, Y.; and Gupta, A. 2018. Zero-shot recognition via semantic embeddings and knowledge graphs. In CVPR.

[40]

Xu, X.; Hospedales, T.; and Gong, S. 2015. Semantic embedding space for zero-shot action recognition. In IEEE International Conference on Image Processing (ICIP), 63–67. IEEE.

[41]

Xu, X.; Hospedales, T. M.; and Gong, S. 2016. Multi-task zero-shot action recognition with prioritised data augmentation. In ECCV.

[42]

Xu, X.; Hospedales, T.; and Gong, S. 2017. Transductive zero-shot action recognition by word-vector embedding. International Journal of Computer Vision 123(3):309–333.

Digital Library

[43]

Zhang, T.; Xu, C.; Zhu, G.; Liu, S.; and Lu, H. 2010. A generic framework for event detection in various video domains. In MM, 103–112.

[44]

Zhang, T.; Xu, C.; Zhu, G.; Liu, S.; and Lu, H. 2012a. A generic framework for video annotation via semi-supervised learning. IEEE Transactions on Multimedia 14(4):1206–1219.

Digital Library

[45]

Zhang, T.; Ghanem, B.; Liu, S.; and Ahuja, N. 2012b. Robust visual tracking via multi-task sparse learning. In CVPR.

[46]

Zhang, H.; Goodfellow, I.; Metaxas, D.; and Odena, A. 2018. Self-attention generative adversarial networks. arXiv preprint arXiv:1805.08318.

[47]

Zhang, T.; Xu, C.; and Yang, M.-H. 2017. Multi-task correlation particle filter for robust object tracking. In CVPR, 1–9.

[48]

Zhang, T.; Xu, C.; and Yang, M.-H. 2018a. Learning multitask correlation particle filters for visual tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence.

[49]

Zhang, T.; Xu, C.; and Yang, M.-H. 2018b. Robust structural sparse tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence.

[50]

Zheng, Y.; Jeon, B.; Sun, L.; Zhang, J.; and Zhang, H. 2017. Student's t-hidden markov model for unsupervised learning using localized feature selection. IEEE Transactions on Circuits and Systems for Video Technology.

[51]

Zhu, Y.; Long, Y.; Guan, Y.; Newsam, S.; and Shao, L. 2018. Towards universal representation for unseen action recognition. In CVPR.

Cited By

Yan RQu HShu XLi WTang JTan TLarson K(2024)DTS-TPTProceedings of the Thirty-Third International Joint Conference on Artificial Intelligence10.24963/ijcai.2024/170(1534-1542)Online publication date: 3-Aug-2024
https://dl.acm.org/doi/10.24963/ijcai.2024/170
He WSabek ILou YCafarella M(2024)Optimizing Video Selection LIMIT Queries with Commonsense KnowledgeProceedings of the VLDB Endowment10.14778/3654621.365463917:7(1751-1764)Online publication date: 1-Mar-2024
https://dl.acm.org/doi/10.14778/3654621.3654639
Wu ZGao JHuang SXu C(2024)Learning Commonsense-aware Moment-Text Alignment for Fast Video Temporal GroundingACM Transactions on Multimedia Computing, Communications, and Applications10.1145/366336820:10(1-22)Online publication date: 12-Sep-2024
https://dl.acm.org/doi/10.1145/3663368
Show More Cited By

Index Terms

I know the relationships: zero-shot action recognition via two-stream graph convolutional networks and knowledge graphs
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
  2. Machine learning
    1. Machine learning approaches
      1. Neural networks

Index terms have been assigned to the content through auto-classification.

Recommendations

Identifying Crucial Know-How and Knowing-That for Medical Decision Support

In this paper, the authors propose a multi-criteria methodology for identifying "Crucial Know-How/Knowing-That". Know-How and Knowing-That are two kinds of knowledge. Know-How is a disposition to perform a type of action whereas Knowing-That is a belief ...
Exploiting vertex relationships in speeding up subgraph isomorphism over large graphs

Subgraph Isomorphism is a fundamental problem in graph data processing. Most existing subgraph isomorphism algorithms are based on a backtracking framework which computes the solutions by incrementally matching all query vertices to candidate data ...
Representing Temporal Relationships Between Events and Their Effects
TIME '97: Proceedings of the 4th International Workshop on Temporal Representation and Reasoning (TIME '97)

Temporal relationships between events and their effects are complex. As the effects of a given event, a proposition may change its truth value immediately after the occurrence of the event and remain true until some other events occur, while another ...

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings

AAAI'19/IAAI'19/EAAI'19: Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence and Thirty-First Innovative Applications of Artificial Intelligence Conference and Ninth AAAI Symposium on Educational Advances in Artificial Intelligence

January 2019

10088 pages

ISBN:978-1-57735-809-1

Copyright © 2019 Association for the Advancement of Artificial Intelligence.

Sponsors

Association for the Advancement of Artificial Intelligence

Publisher

AAAI Press

Publication History

Published: 27 January 2019

Qualifiers

Research-article
Research
Refereed limited

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

27
Total Citations
View Citations
175
Total Downloads

Downloads (Last 12 months)82
Downloads (Last 6 weeks)9

Reflects downloads up to 28 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Yan RQu HShu XLi WTang JTan TLarson K(2024)DTS-TPTProceedings of the Thirty-Third International Joint Conference on Artificial Intelligence10.24963/ijcai.2024/170(1534-1542)Online publication date: 3-Aug-2024
https://dl.acm.org/doi/10.24963/ijcai.2024/170
He WSabek ILou YCafarella M(2024)Optimizing Video Selection LIMIT Queries with Commonsense KnowledgeProceedings of the VLDB Endowment10.14778/3654621.365463917:7(1751-1764)Online publication date: 1-Mar-2024
https://dl.acm.org/doi/10.14778/3654621.3654639
Wu ZGao JHuang SXu C(2024)Learning Commonsense-aware Moment-Text Alignment for Fast Video Temporal GroundingACM Transactions on Multimedia Computing, Communications, and Applications10.1145/366336820:10(1-22)Online publication date: 12-Sep-2024
https://dl.acm.org/doi/10.1145/3663368
Rong HQian MMa TJin DSheng V(2024)CoBjeason: Reasoning Covered Object in Image by Multi-Agent Collaboration Based on Informed Knowledge GraphACM Transactions on Knowledge Discovery from Data10.1145/364356518:5(1-56)Online publication date: 28-Feb-2024
https://dl.acm.org/doi/10.1145/3643565
Dong ZWu LZhang KLiu YZhang YLi ZZhao HChen ESerra ESpezzano F(2024)FZR: Enhancing Knowledge Transfer via Shared Factors Composition in Zero-Shot Relational LearningProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3679770(497-507)Online publication date: 21-Oct-2024
https://dl.acm.org/doi/10.1145/3627673.3679770
Lee JLee D(2024)ESC-ZSARExpert Systems with Applications: An International Journal10.1016/j.eswa.2024.124786255:PDOnline publication date: 21-Nov-2024
https://dl.acm.org/doi/10.1016/j.eswa.2024.124786
Yang ZAn GZheng ZCao SWang F(2024)EPK-CLIPExpert Systems with Applications: An International Journal10.1016/j.eswa.2024.124183252:PAOnline publication date: 24-Jul-2024
https://dl.acm.org/doi/10.1016/j.eswa.2024.124183
Shi ZLi HZhao DPan C(2024)Research on quality assessment methods for cybersecurity knowledge graphsComputers and Security10.1016/j.cose.2024.103848142:COnline publication date: 1-Jul-2024
https://dl.acm.org/doi/10.1016/j.cose.2024.103848
Liu YSang GLiu ZPan YCheng JZhang Y(2024)MPTNComputers in Biology and Medicine10.1016/j.compbiomed.2023.107800168:COnline publication date: 12-Apr-2024
https://dl.acm.org/doi/10.1016/j.compbiomed.2023.107800
Zheng YLiu YChe X(2024)SiamRCSC: Robust siamese network with channel and spatial constraints for visual object trackingMultimedia Systems10.1007/s00530-024-01524-430:6Online publication date: 23-Oct-2024
https://dl.acm.org/doi/10.1007/s00530-024-01524-4
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Figures

Tables

Media

View Table of Conten