Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Semantic Reasoning in Zero Example Video Event Retrieval

Published: 04 October 2017 Publication History

Abstract

Searching in digital video data for high-level events, such as a parade or a car accident, is challenging when the query is textual and lacks visual example images or videos. Current research in deep neural networks is highly beneficial for the retrieval of high-level events using visual examples, but without examples it is still hard to (1) determine which concepts are useful to pre-train (Vocabulary challenge) and (2) which pre-trained concept detectors are relevant for a certain unseen high-level event (Concept Selection challenge). In our article, we present our Semantic Event Retrieval System which (1) shows the importance of high-level concepts in a vocabulary for the retrieval of complex and generic high-level events and (2) uses a novel concept selection method (i-w2v) based on semantic embeddings. Our experiments on the international TRECVID Multimedia Event Detection benchmark show that a diverse vocabulary including high-level concepts improves performance on the retrieval of high-level events in videos and that our novel method outperforms a knowledge-based concept selection method.

References

[1]
Robin Aly, Djoerd Hiemstra, Franciska de Jong, and Peter M. G. Apers. 2012. Simulating the future of concept-based video retrieval under improved detector performance. Multimed. Tools Appl. 60, 1 (2012), 203--231.
[2]
Lamberto Ballan, Marco Bertini, Alberto Del Bimbo, Lorenzo Seidenari, and Giuseppe Serra. 2011. Event detection and recognition for semantic annotation of video. Multimed. Tools Appl. 51, 1 (2011), pp. 279--302.
[3]
Claudio Carpineto and Giovanni Romano. 2012. A survey of automatic query expansion in information retrieval. ACM Comput. Surv. 44, 1 (2012), 1.
[4]
Xiaojun Chang, Yi Yang, Alexander G. Hauptmann, Eric P. Xing, and Yao-Liang Yu. 2015. Semantic concept discovery for large-scale zero-shot event detection. In Proceedings of the 24th International Conference on Artificial Intelligence. AAAI Press, 2234--2240.
[5]
Xiaojun Chang, Yi Yang, Guodong Long, Chengqi Zhang, and Alexander G. Hauptmann. 2016. Dynamic concept composition for zero-example event detection. In AAAI. 3464--3470.
[6]
Jiawei Chen, Yin Cui, Guangnan Ye, Dong Liu, and Shih-Fu Chang. 2014. Event-driven semantic concept discovery by exploiting weakly tagged internet images. In Proceedings of the International Conference on Multimedia Retrieval. ACM, 1.
[7]
Jeffrey Dalton, James Allan, and Pranav Mirajkar. 2013. Zero-shot video retrieval using content and concepts. In Proceedings of the 22nd ACM International Conference Information & Knowledge Management. ACM, 1857--1860.
[8]
Maaike de Boer, Klamer Schutte, and Wessel Kraaij. 2015. Knowledge based query expansion in complex multimedia event detection. Multimed. Tools Appl. (2015), 1--19.
[9]
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR’09). IEEE, 248--255.
[10]
Amirhossein Habibian, Thomas Mensink, and Cees G. M. Snoek. 2014a. Composite concept discovery for zero-shot video event detection. In Proceedings of the International Conference on Multimedia Retrieval. ACM, 17.
[11]
Amirhossein Habibian, Thomas Mensink, and Cees G. M. Snoek. 2014b. Videostory: A new multimedia embedding for few-example recognition and translation of events. In Proceedings of the International Conference on Multimedia. ACM, 17--26.
[12]
Amirhossein Habibian, Koen E. A. van de Sande, and Cees G. M. Snoek. 2013. Recommendations for video event recognition using concept vocabularies. In Proceedings of the 3rd International Conference on Multimedia Retrieval. ACM, 89--96.
[13]
Alexander Hauptmann, Rong Yan, and Wei-Hao Lin. 2007a. How many high-level concepts will fill the semantic gap in news video retrieval?. In Proceedings of the 6th ACM International Conference on Image and Video Retrieval. ACM, 627--634.
[14]
Alexander Hauptmann, Rong Yan, Wei-Hao Lin, Michael Christel, and Howard Wactlar. 2007b. Can high-level concepts fill the semantic gap in video retrieval? A case study with broadcast news. IEEE Trans. Multimed. 9, 5 (2007), 958--966.
[15]
Bouke Huurnink, Katja Hofmann, and Maarten De Rijke. 2008. Assessing concept selection for video retrieval. In Proceedings of the 1st ACM International Conference on Multimedia Information Retrieval. ACM, 459--466.
[16]
Mihir Jain, Jan C. van Gemert, Thomas Mensink, and Cees G. M. Snoek. 2015. Objects2action: Classifying and localizing actions without any video example. In Proceedings of the IEEE International Conference on Computer Vision. 4588--4596.
[17]
Lu Jiang, Deyu Meng, Teruko Mitamura, and Alexander G. Hauptmann. 2014a. Easy samples first: Self-paced reranking for zero-example multimedia search. In Proceedings of the ACM International Conference on Multimedia. ACM, 547--556.
[18]
Lu Jiang, Teruko Mitamura, Shoou-I. Yu, and Alexander G. Hauptmann. 2014b. Zero-example event search using multimodal pseudo relevance feedback. In Proceedings of the International Conference on Multimedia Retrieval. ACM, 297.
[19]
Lu Jiang, Shoou-I. Yu, Deyu Meng, Teruko Mitamura, and Alexander G. Hauptmann. 2015b. Bridging the ultimate semantic gap: A semantic search engine for internet videos. In Proceedings of the ACM International Conference on Multimedia Retrieval. 27--34.
[20]
Yu-Gang Jiang, Subhabrata Bhattacharya, Shih-Fu Chang, and Mubarak Shah. 2012. High-level event recognition in unconstrained videos. Int. J. Multimed. Inf. Retriev. (2012), 1--29.
[21]
Yu-Gang Jiang, Zuxuan Wu, Jun Wang, Xiangyang Xue, and Shi-Fu Chang. 2017. Exploiting feature and class relationships in video categorization with regularized deep neural networks. In IEEE Transactions on Pattern Analysis and Machine Intelligence.
[22]
Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei. 2014. Large-scale video classification with convolutional neural networks. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR’14). 1725--1732.
[23]
Lyndon Kennedy and Alexander Hauptmann. 2006. LSCOM lexicon definitions and annotations (version 1.0). (2006).
[24]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems. 1097--1105.
[25]
Omer Levy and Yoav Goldberg. 2014. Neural word embedding as implicit matrix factorization. In Advances in Neural Information Processing Systems. 2177--2185.
[26]
Omer Levy, Yoav Goldberg, and Ido Dagan. 2015. Improving distributional similarity with lessons learned from word embeddings. Trans. Assoc. Comput. Ling. 3 (2015), 211--225.
[27]
Ying Liu, Dengsheng Zhang, Guojun Lu, and Wei-Ying Ma. 2007. A survey of content-based image retrieval with high-level semantics. Pattern Recogn. 40, 1 (2007), 262--282.
[28]
Yi-Jie Lu, Hao Zhang, Maaike de Boer, and Chong-Wah Ngo. 2016. Event detection with zero example: Select the right and suppress the wrong concepts. In Proceedings of the 2016 ACM on International Conference on Multimedia Retrieval. ACM, 127--134.
[29]
Masoud Mazloom, Efstratios Gavves, Koen van de Sande, and Cees Snoek. 2013. Searching informative concept banks for video event detection. In Proceedings of the 3rd International Conference on Multimedia Retrieval. ACM, 255--262.
[30]
Thomas Mensink, Efstratios Gavves, and Cees G. M. Snoek. 2014. COSTA: Co-occurrence statistics for zero-shot classification. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR’14). IEEE, 2441--2448.
[31]
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems. 3111--3119.
[32]
George A. Miller. 1995. WordNet: A lexical database for english. Commun. ACM 38, 11 (1995), pp. 39--41.
[33]
David Milne and Ian H. Witten. 2013. An open-source toolkit for mining Wikipedia. Artif. Intell. 194 (2013), pp. 222--239.
[34]
Apostol Paul Natsev, Alexander Haubold, Jelena Tešić, Lexing Xie, and Rong Yan. 2007. Semantic concept-based query expansion and re-ranking for multimedia retrieval. In Proceedings of the 15th International Conference on Multimedia. ACM, 991--1000.
[35]
Shi-Yong Neo, Jin Zhao, Min-Yen Kan, and Tat-Seng Chua. 2006. Video retrieval using high level features: Exploiting query matching and confidence-based weighting. In International Conference on Image and Video Retrieval. Springer, 143--152.
[36]
Paul Over, George Awad, Martial Michel, Jonathan Fiscus, Greg Sanders, Wessel Kraaij, Alan F. Smeaton, and Georges Quenot. 2014. TRECVID 2014 -- An overview of the goals, tasks, data, evaluation mechanisms and metrics. In Proceedings of the Annual TREC Video Retrieval Evaluation (TRECVID’14). NIST, USA.
[37]
Paul Over, George Awad, Martial Michel, Jonathan Fiscus, Greg Sanders, Wessel Kraaij, Alan F. Smeaton, Georges Quenot, and Roeland Ordelman. 2015. TRECVID 2015—An overview of the goals, tasks, data, evaluation mechanisms and metrics. In Proceedings of the Annual TREC Video Retrieval Evaluation (TRECVID’15). NIST.
[38]
Pushpa B. Patil and Manesh B. Kokare. 2011. Relevance feedback in content based image retrieval: A review.J. Appl. Comput. Sci. Math. 10, 10 (2011), pp. 40--47.
[39]
Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’14), Vol. 14. 1532--1543.
[40]
Alan F. Smeaton, Paul Over, and Wessel Kraaij. 2006. Evaluation campaigns and TRECVid. In Proceedings of the 8th ACM International Workshop on Multimedia Information Retrieval. ACM, 321--330.
[41]
Steve Spagnola and Carl Lagoze. 2011. Edge dependent pathway scoring for calculating semantic similarity in ConceptNet. In Proceedings of the 9th International Conference on Computational Semantics. Association for Computational Linguistics, 385--389.
[42]
Bart Thomee, David A. Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, and Li-Jia Li. 2015. The new data and new challenges in multimedia research. arXiv preprint arXiv:1503.01817 (2015).
[43]
Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3D convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision. 4489--4497.
[44]
Christos Tzelepis, Damianos Galanopoulos, Vasileios Mezaris, and Ioannis Patras. 2016. Learning to detect video events from zero or very few video examples. Image and Vision Computing 53, 35--44.
[45]
Shuang Wu, Sravanthi Bondugula, Florian Luisier, Xiaodan Zhuang, and Prem Natarajan. 2014. Zero-shot event detection using multi-modal fusion of weakly supervised concepts. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2665--2672.
[46]
Shicheng Xu, Huan Li, Xiaojun Chang, Shoou-I. Yu, Xingzhong Du, Xuanchong Li, Lu Jiang, Zexi Mao, Zhenzhong Lan, Susanne Burger, and others. 2015. Incremental multimodal query construction for video search. In Proceedings of the 5th ACM on International Conference on Multimedia Retrieval. ACM, 675--678.
[47]
Yan Yan, Yi Yang, Haoquan Shen, Deyu Meng, Gaowen Liu, Alex Hauptmann, and Nicu Sebe. 2015. Complex event detection via event oriented dictionary learning. In Proceedings of the 29th AAAI Conference on Artificial Intelligence.
[48]
Guangnan Ye, Yitong Li, Hongliang Xu, Dong Liu, and Shih-Fu Chang. 2015. Eventnet: A large scale structured concept library for complex event detection in video. In Proceedings of the 23rd Annual ACM Conference on Multimedia Conference. ACM, 471--480.
[49]
Shoou-I. Yu, Lu Jiang, and Alexander Hauptmann. 2014. Instructional videos for unsupervised harvesting and learning of action examples. In Proceedings of the ACM International Conference on Multimedia. ACM, 825--828.
[50]
Bolei Zhou, Agata Lapedriza, Jianxiong Xiao, Antonio Torralba, and Aude Oliva. 2014. Learning deep features for scene recognition using places database. In Advances in Neural Information Processing Systems. 487--495.

Cited By

View all
  • (2024)Improving Interpretable Embeddings for Ad-hoc Video Search with Generative Captions and Multi-word Concept BankProceedings of the 2024 International Conference on Multimedia Retrieval10.1145/3652583.3658052(73-82)Online publication date: 30-May-2024
  • (2024)Encrypted Video Search with Single/Multiple WritersACM Transactions on Multimedia Computing, Communications, and Applications10.1145/3643887Online publication date: 5-Feb-2024
  • (2023)(Un)likelihood Training for Interpretable EmbeddingACM Transactions on Information Systems10.1145/363275242:3(1-26)Online publication date: 13-Nov-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Multimedia Computing, Communications, and Applications
ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 13, Issue 4
November 2017
362 pages
ISSN:1551-6857
EISSN:1551-6865
DOI:10.1145/3129737
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 October 2017
Accepted: 01 July 2017
Revised: 01 May 2017
Received: 01 July 2016
Published in TOMM Volume 13, Issue 4

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Content-based visual information retrieval
  2. multimedia event detection
  3. semantics
  4. zero shot

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

  • Research Grants Council of the Hong Kong Special Administrative Region, China

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)7
  • Downloads (Last 6 weeks)0
Reflects downloads up to 03 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Improving Interpretable Embeddings for Ad-hoc Video Search with Generative Captions and Multi-word Concept BankProceedings of the 2024 International Conference on Multimedia Retrieval10.1145/3652583.3658052(73-82)Online publication date: 30-May-2024
  • (2024)Encrypted Video Search with Single/Multiple WritersACM Transactions on Multimedia Computing, Communications, and Applications10.1145/3643887Online publication date: 5-Feb-2024
  • (2023)(Un)likelihood Training for Interpretable EmbeddingACM Transactions on Information Systems10.1145/363275242:3(1-26)Online publication date: 13-Nov-2023
  • (2023)Semantics aware intelligent framework for content-based e-learning recommendationNatural Language Processing Journal10.1016/j.nlp.2023.1000083(100008)Online publication date: Jun-2023
  • (2022)Zero-Shot Video Event Detection With High-Order Semantic Concept Discovery and MatchingIEEE Transactions on Multimedia10.1109/TMM.2021.307362424(1896-1908)Online publication date: 2022
  • (2021)Interactive Search vs. Automatic SearchACM Transactions on Multimedia Computing, Communications, and Applications10.1145/342945717:2(1-24)Online publication date: 11-May-2021
  • (2021)Coarse-to-Fine Semantic Alignment for Cross-Modal Moment LocalizationIEEE Transactions on Image Processing10.1109/TIP.2021.309052130(5933-5943)Online publication date: 1-Jan-2021
  • (2021)Video Moment Localization via Deep Cross-Modal HashingIEEE Transactions on Image Processing10.1109/TIP.2021.307386730(4667-4677)Online publication date: 2021
  • (2020)A Knowledge-Driven Multimedia Retrieval System Based on Semantics and Deep FeaturesFuture Internet10.3390/fi1211018312:11(183)Online publication date: 28-Oct-2020
  • (2019)Text Mining in Cybersecurity: Exploring Threats and OpportunitiesMultimodal Technologies and Interaction10.3390/mti30300623:3(62)Online publication date: 15-Sep-2019
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media