Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1609/aaai.v33i01.33018303guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
research-article
Free access

I know the relationships: zero-shot action recognition via two-stream graph convolutional networks and knowledge graphs

Published: 27 January 2019 Publication History

Abstract

Recently, with the ever-growing action categories, zero-shot action recognition (ZSAR) has been achieved by automatically mining the underlying concepts (e.g., actions, attributes) in videos. However, most existing methods only exploit the visual cues of these concepts but ignore external knowledge information for modeling explicit relationships between them. In fact, humans have remarkable ability to transfer knowledge learned from familiar classes to recognize unfamiliar classes. To narrow the knowledge gap between existing methods and humans, we propose an end-to-end ZSAR framework based on a structured knowledge graph, which can jointly model the relationships between action-attribute, action-action, and attribute-attribute. To effectively leverage the knowledge graph, we design a novel Two-Stream Graph Convolutional Network (TS-GCN) consisting of a classifier branch and an instance branch. Specifically, the classifier branch takes the semantic-embedding vectors of all the concepts as input, then generates the classifiers for action categories. The instance branch maps the attribute embeddings and scores of each video instance into an attribute-feature space. Finally, the generated classifiers are evaluated on the attribute features of each video, and a classification loss is adopted for optimizing the whole network. In addition, a self-attention module is utilized to model the temporal information of videos. Extensive experimental results on three realistic action benchmarks Olympic Sports, HMDB51 and UCF101 demonstrate the favorable performance of our proposed framework.

References

[1]
Abadi, M.; Barham, P.; Chen, J.; Chen, Z.; Davis, A.; Dean, J.; Devin, M.; Ghemawat, S.; Irving, G.; Isard, M.; et al. 2016. Tensorflow: a system for large-scale machine learning. In OSDI.
[2]
Akata, Z.; Reed, S.; Walter, D.; Lee, H.; and Schiele, B. 2015. Evaluation of output embeddings for fine-grained image classification. In CVPR.
[3]
Auer, S.; Bizer, C.; Kobilarov, G.; Lehmann, J.; Cyganiak, R.; and Ives, Z. 2007. Dbpedia: A nucleus for a web of open data. In The Semantic Web. Springer. 722–735.
[4]
Bond, F., and Foster, R. 2013. Linking and extending an open multilingual wordnet. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, 1352–1362.
[5]
Fang, Y.; Kuan, K.; Lin, J.; Tan, C.; and Chandrasekhar, V. 2017. Object detection meets knowledge graphs. In IJCAI.
[6]
Gan, C.; Lin, M.; Yang, Y.; Zhuang, Y.; and Hauptmann, A. G. 2015. Exploring semantic interclass relationships (sir) for zero-shot action recognition. In AAAI.
[7]
Gan, C.; Lin, M.; Yang, Y.; de Melo, G.; and Hauptmann, A. G. 2016a. Concepts not alone: Exploring pairwise relationships for zero-shot video activity recognition. In AAAI.
[8]
Gan, C.; Yang, Y.; Zhu, L.; Zhao, D.; and Zhuang, Y. 2016b. Recognizing an action using its name: A knowledge-based approach. International Journal of Computer Vision 120(1):61–77.
[9]
Gan, C.; Yang, T.; and Gong, B. 2016. Learning attributes equals multi-source domain generalization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 87–97.
[10]
Gao, J.; Zhang, T.; Yang, X.; and Xu, C. 2017. Deep relative tracking. IEEE Transactions on Image Processing 26(4):1845–1858.
[11]
Gao, J.; Zhang, T.; Yang, X.; and Xu, C. 2018. P2t: Part-to-target tracking via deep regression learning. IEEE Transactions on Image Processing 27(6):3074–3086.
[12]
Gao, J.; Zhang, T.; and Xu, C. 2017. A unified personalized video recommendation via dynamic recurrent neural networks. In ACM MM, 127–135.
[13]
Gao, J.; Zhang, T.; and Xu, C. 2018. Watch, think and attend: End-to-end video classification via dynamic knowledge evolution modeling. In 2018 ACM Multimedia Conference on Multimedia Conference, 690–699. ACM.
[14]
Han, J.; Zhang, D.; Cheng, G.; Liu, N.; and Xu, D. 2018. Advanced deep-learning techniques for salient and category-specific object detection: a survey. IEEE Signal Processing Magazine 35(1):84-100.
[15]
Jain, M.; van Gemert, J. C; Mensink, T.; and Snoek, C. G. 2015. Objects2action: Classifying and localizing actions without any video example. In ICCV.
[16]
Kingma, D. P., and Ba, J. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv: 1412.6980.
[17]
Kipf, T. N., and Welling, M. 2016. Semi-supervised classification with graph convolutional networks. In ICLR.
[18]
Kodirov, E.; Xiang, T.; Fu, Z.; and Gong, S. 2015. Unsupervised domain adaptation for zero-shot learning. In ICCV.
[19]
Kuehne, H.; Jhuang, H.; Garrote, E.; Poggio, T.; and Serre, T. 2011. HMDB: a large video database for human motion recognition. In ICCV.
[20]
Lampert, C. H.; Nickisch, H.; and Harmeling, S. 2009. Learning to detect unseen object classes by between-class attribute transfer. In CVPR.
[21]
Lee, C.-W.; Fang, W.; Yeh, C.-K.; and Wang, Y.-C. F. 2018. Multi-label zero-shot learning with structured knowledge graphs. In CVPR.
[22]
Liu, J.; Kuipers, B.; and Savarese, S. 2011. Recognizing human actions by attributes. In CVPR.
[23]
Maaten, L. v. d., and Hinton, G. 2008. Visualizing data using t-sne. Journal of machine learning research 9(Nov):2579-2605.
[24]
Marino, K.; Salakhutdinov, R.; and Gupta, A. 2017. The more you know: Using knowledge graphs for image classification. In CVPR.
[25]
Mettes, P., and Snoek, C. G. 2017. Spatial-aware object embeddings for zero-shot localization and classification of actions. In ICCV.
[26]
Mettes, P.; Koelma, D. C; and Snoek, C. G. 2016. The imagenet shuffle: Reorganized pre-training for video event detection. In ICMR.
[27]
Mikolov, T.; Chen, K.; Corrado, G.; and Dean, J. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.
[28]
Mishra, A.; Verma, V. K.; Reddy, M.; Rai, P.; Mittal, A.; et al. 2018. A generative approach to zero-shot and few-shot action recognition. arXiv preprint arXiv: 1801.09086.
[29]
Niebles, J. C; Chen, C.-W.; and Fei-Fei, L. 2010. Modeling temporal structure of decomposable motion segments for activity classification. In ECCV.
[30]
Norouzi, M.; Mikolov, T.; Bengio, S.; Singer, Y.; Shlens, J.; Frome, A.; Corrado, G. S.; and Dean, J. 2013. Zero-shot learning by convex combination of semantic embeddings. arXiv preprint arXiv: 1312.5650.
[31]
Qin, J.; Liu, L.; Shao, L.; Shen, F.; Ni, B.; Chen, J.; and Wang, Y. 2017. Zero-shot action recognition with error-correcting output codes. In CVPR.
[32]
Romera-Paredes, B., and Torr, P. 2015. An embarrassingly simple approach to zero-shot learning. In ICML.
[33]
Song, J.; Shen, C.; Yang, Y.; Liu, Y.; and Song, M. 2018. Transductive unbiased embedding for zero-shot learning. In CVPR.
[34]
Soomro, K.; Zamir, A. R.; and Shah, M. 2012. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402.
[35]
Speer, R.; Chin, J.; and Havasi, C. 2017. Conceptnet 5.5: An open multilingual graph of general knowledge. In AAAI.
[36]
Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; and Rabinovich, A. 2015. Going deeper with convolutions. In CVPR.
[37]
Thomee, B.; Shamma, D. A.; Friedland, G.; Elizalde, B.; Ni, K.; Poland, D.; Borth, D.; and Li, L.-J. 2016. Yfcc100m: The new data in multimedia research. Communications of the ACM 59(2):64–73.
[38]
Tran, D.; Bourdev, L.; Fergus, R.; Torresani, L.; and Paluri, M. 2015. Learning spatiotemporal features with 3d convolutional networks. In ICCV.
[39]
Wang, X.; Ye, Y.; and Gupta, A. 2018. Zero-shot recognition via semantic embeddings and knowledge graphs. In CVPR.
[40]
Xu, X.; Hospedales, T.; and Gong, S. 2015. Semantic embedding space for zero-shot action recognition. In IEEE International Conference on Image Processing (ICIP), 63–67. IEEE.
[41]
Xu, X.; Hospedales, T. M.; and Gong, S. 2016. Multi-task zero-shot action recognition with prioritised data augmentation. In ECCV.
[42]
Xu, X.; Hospedales, T.; and Gong, S. 2017. Transductive zero-shot action recognition by word-vector embedding. International Journal of Computer Vision 123(3):309–333.
[43]
Zhang, T.; Xu, C.; Zhu, G.; Liu, S.; and Lu, H. 2010. A generic framework for event detection in various video domains. In MM, 103–112.
[44]
Zhang, T.; Xu, C.; Zhu, G.; Liu, S.; and Lu, H. 2012a. A generic framework for video annotation via semi-supervised learning. IEEE Transactions on Multimedia 14(4):1206–1219.
[45]
Zhang, T.; Ghanem, B.; Liu, S.; and Ahuja, N. 2012b. Robust visual tracking via multi-task sparse learning. In CVPR.
[46]
Zhang, H.; Goodfellow, I.; Metaxas, D.; and Odena, A. 2018. Self-attention generative adversarial networks. arXiv preprint arXiv:1805.08318.
[47]
Zhang, T.; Xu, C.; and Yang, M.-H. 2017. Multi-task correlation particle filter for robust object tracking. In CVPR, 1–9.
[48]
Zhang, T.; Xu, C.; and Yang, M.-H. 2018a. Learning multitask correlation particle filters for visual tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence.
[49]
Zhang, T.; Xu, C.; and Yang, M.-H. 2018b. Robust structural sparse tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence.
[50]
Zheng, Y.; Jeon, B.; Sun, L.; Zhang, J.; and Zhang, H. 2017. Student's t-hidden markov model for unsupervised learning using localized feature selection. IEEE Transactions on Circuits and Systems for Video Technology.
[51]
Zhu, Y.; Long, Y.; Guan, Y.; Newsam, S.; and Shao, L. 2018. Towards universal representation for unseen action recognition. In CVPR.

Cited By

View all
  • (2024)DTS-TPTProceedings of the Thirty-Third International Joint Conference on Artificial Intelligence10.24963/ijcai.2024/170(1534-1542)Online publication date: 3-Aug-2024
  • (2024)Optimizing Video Selection LIMIT Queries with Commonsense KnowledgeProceedings of the VLDB Endowment10.14778/3654621.365463917:7(1751-1764)Online publication date: 1-Mar-2024
  • (2024)Learning Commonsense-aware Moment-Text Alignment for Fast Video Temporal GroundingACM Transactions on Multimedia Computing, Communications, and Applications10.1145/366336820:10(1-22)Online publication date: 12-Sep-2024
  • Show More Cited By

Index Terms

  1. I know the relationships: zero-shot action recognition via two-stream graph convolutional networks and knowledge graphs
      Index terms have been assigned to the content through auto-classification.

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image Guide Proceedings
      AAAI'19/IAAI'19/EAAI'19: Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence and Thirty-First Innovative Applications of Artificial Intelligence Conference and Ninth AAAI Symposium on Educational Advances in Artificial Intelligence
      January 2019
      10088 pages
      ISBN:978-1-57735-809-1

      Sponsors

      • Association for the Advancement of Artificial Intelligence

      Publisher

      AAAI Press

      Publication History

      Published: 27 January 2019

      Qualifiers

      • Research-article
      • Research
      • Refereed limited

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)82
      • Downloads (Last 6 weeks)9
      Reflects downloads up to 28 Jan 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)DTS-TPTProceedings of the Thirty-Third International Joint Conference on Artificial Intelligence10.24963/ijcai.2024/170(1534-1542)Online publication date: 3-Aug-2024
      • (2024)Optimizing Video Selection LIMIT Queries with Commonsense KnowledgeProceedings of the VLDB Endowment10.14778/3654621.365463917:7(1751-1764)Online publication date: 1-Mar-2024
      • (2024)Learning Commonsense-aware Moment-Text Alignment for Fast Video Temporal GroundingACM Transactions on Multimedia Computing, Communications, and Applications10.1145/366336820:10(1-22)Online publication date: 12-Sep-2024
      • (2024)CoBjeason: Reasoning Covered Object in Image by Multi-Agent Collaboration Based on Informed Knowledge GraphACM Transactions on Knowledge Discovery from Data10.1145/364356518:5(1-56)Online publication date: 28-Feb-2024
      • (2024)FZR: Enhancing Knowledge Transfer via Shared Factors Composition in Zero-Shot Relational LearningProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3679770(497-507)Online publication date: 21-Oct-2024
      • (2024)ESC-ZSARExpert Systems with Applications: An International Journal10.1016/j.eswa.2024.124786255:PDOnline publication date: 21-Nov-2024
      • (2024)EPK-CLIPExpert Systems with Applications: An International Journal10.1016/j.eswa.2024.124183252:PAOnline publication date: 24-Jul-2024
      • (2024)Research on quality assessment methods for cybersecurity knowledge graphsComputers and Security10.1016/j.cose.2024.103848142:COnline publication date: 1-Jul-2024
      • (2024)MPTNComputers in Biology and Medicine10.1016/j.compbiomed.2023.107800168:COnline publication date: 12-Apr-2024
      • (2024)SiamRCSC: Robust siamese network with channel and spatial constraints for visual object trackingMultimedia Systems10.1007/s00530-024-01524-430:6Online publication date: 23-Oct-2024
      • Show More Cited By

      View Options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Login options

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media