research-article

Semantic Reasoning in Zero Example Video Event Retrieval

Authors:

Maaike H. T. De Boer,

Klamer Schutte,

Wessel KraaijAuthors Info & Claims

ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), Volume 13, Issue 4

Article No.: 60, Pages 1 - 17

https://doi.org/10.1145/3131288

Published: 04 October 2017 Publication History

Abstract

Searching in digital video data for high-level events, such as a parade or a car accident, is challenging when the query is textual and lacks visual example images or videos. Current research in deep neural networks is highly beneficial for the retrieval of high-level events using visual examples, but without examples it is still hard to (1) determine which concepts are useful to pre-train (Vocabulary challenge) and (2) which pre-trained concept detectors are relevant for a certain unseen high-level event (Concept Selection challenge). In our article, we present our Semantic Event Retrieval System which (1) shows the importance of high-level concepts in a vocabulary for the retrieval of complex and generic high-level events and (2) uses a novel concept selection method (i-w2v) based on semantic embeddings. Our experiments on the international TRECVID Multimedia Event Detection benchmark show that a diverse vocabulary including high-level concepts improves performance on the retrieval of high-level events in videos and that our novel method outperforms a knowledge-based concept selection method.

References

[1]

Robin Aly, Djoerd Hiemstra, Franciska de Jong, and Peter M. G. Apers. 2012. Simulating the future of concept-based video retrieval under improved detector performance. Multimed. Tools Appl. 60, 1 (2012), 203--231.

Digital Library

[2]

Lamberto Ballan, Marco Bertini, Alberto Del Bimbo, Lorenzo Seidenari, and Giuseppe Serra. 2011. Event detection and recognition for semantic annotation of video. Multimed. Tools Appl. 51, 1 (2011), pp. 279--302.

Digital Library

[3]

Claudio Carpineto and Giovanni Romano. 2012. A survey of automatic query expansion in information retrieval. ACM Comput. Surv. 44, 1 (2012), 1.

Digital Library

[4]

Xiaojun Chang, Yi Yang, Alexander G. Hauptmann, Eric P. Xing, and Yao-Liang Yu. 2015. Semantic concept discovery for large-scale zero-shot event detection. In Proceedings of the 24th International Conference on Artificial Intelligence. AAAI Press, 2234--2240.

[5]

Xiaojun Chang, Yi Yang, Guodong Long, Chengqi Zhang, and Alexander G. Hauptmann. 2016. Dynamic concept composition for zero-example event detection. In AAAI. 3464--3470.

[6]

Jiawei Chen, Yin Cui, Guangnan Ye, Dong Liu, and Shih-Fu Chang. 2014. Event-driven semantic concept discovery by exploiting weakly tagged internet images. In Proceedings of the International Conference on Multimedia Retrieval. ACM, 1.

Digital Library

[7]

Jeffrey Dalton, James Allan, and Pranav Mirajkar. 2013. Zero-shot video retrieval using content and concepts. In Proceedings of the 22nd ACM International Conference Information & Knowledge Management. ACM, 1857--1860.

Digital Library

[8]

Maaike de Boer, Klamer Schutte, and Wessel Kraaij. 2015. Knowledge based query expansion in complex multimedia event detection. Multimed. Tools Appl. (2015), 1--19.

[9]

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR’09). IEEE, 248--255.

[10]

Amirhossein Habibian, Thomas Mensink, and Cees G. M. Snoek. 2014a. Composite concept discovery for zero-shot video event detection. In Proceedings of the International Conference on Multimedia Retrieval. ACM, 17.

Digital Library

[11]

Amirhossein Habibian, Thomas Mensink, and Cees G. M. Snoek. 2014b. Videostory: A new multimedia embedding for few-example recognition and translation of events. In Proceedings of the International Conference on Multimedia. ACM, 17--26.

Digital Library

[12]

Amirhossein Habibian, Koen E. A. van de Sande, and Cees G. M. Snoek. 2013. Recommendations for video event recognition using concept vocabularies. In Proceedings of the 3rd International Conference on Multimedia Retrieval. ACM, 89--96.

Digital Library

[13]

Alexander Hauptmann, Rong Yan, and Wei-Hao Lin. 2007a. How many high-level concepts will fill the semantic gap in news video retrieval?. In Proceedings of the 6th ACM International Conference on Image and Video Retrieval. ACM, 627--634.

Digital Library

[14]

Alexander Hauptmann, Rong Yan, Wei-Hao Lin, Michael Christel, and Howard Wactlar. 2007b. Can high-level concepts fill the semantic gap in video retrieval? A case study with broadcast news. IEEE Trans. Multimed. 9, 5 (2007), 958--966.

Digital Library

[15]

Bouke Huurnink, Katja Hofmann, and Maarten De Rijke. 2008. Assessing concept selection for video retrieval. In Proceedings of the 1st ACM International Conference on Multimedia Information Retrieval. ACM, 459--466.

Digital Library

[16]

Mihir Jain, Jan C. van Gemert, Thomas Mensink, and Cees G. M. Snoek. 2015. Objects2action: Classifying and localizing actions without any video example. In Proceedings of the IEEE International Conference on Computer Vision. 4588--4596.

Digital Library

[17]

Lu Jiang, Deyu Meng, Teruko Mitamura, and Alexander G. Hauptmann. 2014a. Easy samples first: Self-paced reranking for zero-example multimedia search. In Proceedings of the ACM International Conference on Multimedia. ACM, 547--556.

Digital Library

[18]

Lu Jiang, Teruko Mitamura, Shoou-I. Yu, and Alexander G. Hauptmann. 2014b. Zero-example event search using multimodal pseudo relevance feedback. In Proceedings of the International Conference on Multimedia Retrieval. ACM, 297.

Digital Library

[19]

Lu Jiang, Shoou-I. Yu, Deyu Meng, Teruko Mitamura, and Alexander G. Hauptmann. 2015b. Bridging the ultimate semantic gap: A semantic search engine for internet videos. In Proceedings of the ACM International Conference on Multimedia Retrieval. 27--34.

Digital Library

[20]

Yu-Gang Jiang, Subhabrata Bhattacharya, Shih-Fu Chang, and Mubarak Shah. 2012. High-level event recognition in unconstrained videos. Int. J. Multimed. Inf. Retriev. (2012), 1--29.

[21]

Yu-Gang Jiang, Zuxuan Wu, Jun Wang, Xiangyang Xue, and Shi-Fu Chang. 2017. Exploiting feature and class relationships in video categorization with regularized deep neural networks. In IEEE Transactions on Pattern Analysis and Machine Intelligence.

Digital Library

[22]

Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei. 2014. Large-scale video classification with convolutional neural networks. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR’14). 1725--1732.

Digital Library

[23]

Lyndon Kennedy and Alexander Hauptmann. 2006. LSCOM lexicon definitions and annotations (version 1.0). (2006).

[24]

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems. 1097--1105.

[25]

Omer Levy and Yoav Goldberg. 2014. Neural word embedding as implicit matrix factorization. In Advances in Neural Information Processing Systems. 2177--2185.

[26]

Omer Levy, Yoav Goldberg, and Ido Dagan. 2015. Improving distributional similarity with lessons learned from word embeddings. Trans. Assoc. Comput. Ling. 3 (2015), 211--225.

[27]

Ying Liu, Dengsheng Zhang, Guojun Lu, and Wei-Ying Ma. 2007. A survey of content-based image retrieval with high-level semantics. Pattern Recogn. 40, 1 (2007), 262--282.

Digital Library

[28]

Yi-Jie Lu, Hao Zhang, Maaike de Boer, and Chong-Wah Ngo. 2016. Event detection with zero example: Select the right and suppress the wrong concepts. In Proceedings of the 2016 ACM on International Conference on Multimedia Retrieval. ACM, 127--134.

Digital Library

[29]

Masoud Mazloom, Efstratios Gavves, Koen van de Sande, and Cees Snoek. 2013. Searching informative concept banks for video event detection. In Proceedings of the 3rd International Conference on Multimedia Retrieval. ACM, 255--262.

Digital Library

[30]

Thomas Mensink, Efstratios Gavves, and Cees G. M. Snoek. 2014. COSTA: Co-occurrence statistics for zero-shot classification. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR’14). IEEE, 2441--2448.

[31]

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems. 3111--3119.

[32]

George A. Miller. 1995. WordNet: A lexical database for english. Commun. ACM 38, 11 (1995), pp. 39--41.

Digital Library

[33]

David Milne and Ian H. Witten. 2013. An open-source toolkit for mining Wikipedia. Artif. Intell. 194 (2013), pp. 222--239.

Digital Library

[34]

Apostol Paul Natsev, Alexander Haubold, Jelena Tešić, Lexing Xie, and Rong Yan. 2007. Semantic concept-based query expansion and re-ranking for multimedia retrieval. In Proceedings of the 15th International Conference on Multimedia. ACM, 991--1000.

Digital Library

[35]

Shi-Yong Neo, Jin Zhao, Min-Yen Kan, and Tat-Seng Chua. 2006. Video retrieval using high level features: Exploiting query matching and confidence-based weighting. In International Conference on Image and Video Retrieval. Springer, 143--152.

Digital Library

[36]

Paul Over, George Awad, Martial Michel, Jonathan Fiscus, Greg Sanders, Wessel Kraaij, Alan F. Smeaton, and Georges Quenot. 2014. TRECVID 2014 -- An overview of the goals, tasks, data, evaluation mechanisms and metrics. In Proceedings of the Annual TREC Video Retrieval Evaluation (TRECVID’14). NIST, USA.

[37]

Paul Over, George Awad, Martial Michel, Jonathan Fiscus, Greg Sanders, Wessel Kraaij, Alan F. Smeaton, Georges Quenot, and Roeland Ordelman. 2015. TRECVID 2015—An overview of the goals, tasks, data, evaluation mechanisms and metrics. In Proceedings of the Annual TREC Video Retrieval Evaluation (TRECVID’15). NIST.

[38]

Pushpa B. Patil and Manesh B. Kokare. 2011. Relevance feedback in content based image retrieval: A review.J. Appl. Comput. Sci. Math. 10, 10 (2011), pp. 40--47.

[39]

Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’14), Vol. 14. 1532--1543.

[40]

Alan F. Smeaton, Paul Over, and Wessel Kraaij. 2006. Evaluation campaigns and TRECVid. In Proceedings of the 8th ACM International Workshop on Multimedia Information Retrieval. ACM, 321--330.

Digital Library

[41]

Steve Spagnola and Carl Lagoze. 2011. Edge dependent pathway scoring for calculating semantic similarity in ConceptNet. In Proceedings of the 9th International Conference on Computational Semantics. Association for Computational Linguistics, 385--389.

[42]

Bart Thomee, David A. Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, and Li-Jia Li. 2015. The new data and new challenges in multimedia research. arXiv preprint arXiv:1503.01817 (2015).

[43]

Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3D convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision. 4489--4497.

Digital Library

[44]

Christos Tzelepis, Damianos Galanopoulos, Vasileios Mezaris, and Ioannis Patras. 2016. Learning to detect video events from zero or very few video examples. Image and Vision Computing 53, 35--44.

Digital Library

[45]

Shuang Wu, Sravanthi Bondugula, Florian Luisier, Xiaodan Zhuang, and Prem Natarajan. 2014. Zero-shot event detection using multi-modal fusion of weakly supervised concepts. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2665--2672.

Digital Library

[46]

Shicheng Xu, Huan Li, Xiaojun Chang, Shoou-I. Yu, Xingzhong Du, Xuanchong Li, Lu Jiang, Zexi Mao, Zhenzhong Lan, Susanne Burger, and others. 2015. Incremental multimodal query construction for video search. In Proceedings of the 5th ACM on International Conference on Multimedia Retrieval. ACM, 675--678.

Digital Library

[47]

Yan Yan, Yi Yang, Haoquan Shen, Deyu Meng, Gaowen Liu, Alex Hauptmann, and Nicu Sebe. 2015. Complex event detection via event oriented dictionary learning. In Proceedings of the 29th AAAI Conference on Artificial Intelligence.

[48]

Guangnan Ye, Yitong Li, Hongliang Xu, Dong Liu, and Shih-Fu Chang. 2015. Eventnet: A large scale structured concept library for complex event detection in video. In Proceedings of the 23rd Annual ACM Conference on Multimedia Conference. ACM, 471--480.

Digital Library

[49]

Shoou-I. Yu, Lu Jiang, and Alexander Hauptmann. 2014. Instructional videos for unsupervised harvesting and learning of action examples. In Proceedings of the ACM International Conference on Multimedia. ACM, 825--828.

Digital Library

[50]

Bolei Zhou, Agata Lapedriza, Jianxiong Xiao, Antonio Torralba, and Aude Oliva. 2014. Learning deep features for scene recognition using places database. In Advances in Neural Information Processing Systems. 487--495.

Cited By

Wu JNgo CChan WGurrin CKongkachandra RSchoeffmann KDang-Nguyen DRossetto LSatoh SZhou L(2024)Improving Interpretable Embeddings for Ad-hoc Video Search with Generative Captions and Multi-word Concept BankProceedings of the 2024 International Conference on Multimedia Retrieval10.1145/3652583.3658052(73-82)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3652583.3658052
Zheng YZhang WSong WWang XFu C(2024)Encrypted Video Search with Single/Multiple WritersACM Transactions on Multimedia Computing, Communications, and Applications10.1145/3643887Online publication date: 5-Feb-2024
https://doi.org/10.1145/3643887
Wu JNgo CChan WHou Z(2023)(Un)likelihood Training for Interpretable EmbeddingACM Transactions on Information Systems10.1145/363275242:3(1-26)Online publication date: 13-Nov-2023
https://dl.acm.org/doi/10.1145/3632752
Show More Cited By

Index Terms

Semantic Reasoning in Zero Example Video Event Retrieval
1. Information systems
  1. Information retrieval
    1. Information retrieval query processing
      1. Query representation
    2. Specialized information retrieval
      1. Multimedia and multimodal retrieval
        Video search

Recommendations

Event Detection with Zero Example: Select the Right and Suppress the Wrong Concepts
ICMR '16: Proceedings of the 2016 ACM on International Conference on Multimedia Retrieval

Complex video event detection without visual examples is a very challenging issue in multimedia retrieval. We present a state-of-the-art framework for event search without any need of exemplar videos and textual metadata in search corpus. To perform ...
Fast and Accurate Content-based Semantic Search in 100M Internet Videos
MM '15: Proceedings of the 23rd ACM international conference on Multimedia

Large-scale content-based semantic search in video is an interesting and fundamental problem in multimedia analysis and retrieval. Existing methods index a video by the raw concept detection score that is dense and inconsistent, and thus cannot scale to ...
Zero-Example Multimedia Event Detection and Recounting with Unsupervised Evidence Localization
MM '16: Proceedings of the 24th ACM international conference on Multimedia

Retrieval of a complex multimedia event has long been regarded as a challenging task. Multimedia event recounting, other than event detection, focuses on providing comprehensible evidence which justifies a detection result. Recounting enables "video ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Multimedia Computing, Communications, and Applications

ACM Transactions on Multimedia Computing, Communications, and Applications Volume 13, Issue 4

November 2017

362 pages

ISSN:1551-6857

EISSN:1551-6865

DOI:10.1145/3129737

Editor:
Alberto Del Bimbo
University of Firenze, Italy

Issue’s Table of Contents

Copyright © 2017 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 October 2017

Accepted: 01 July 2017

Revised: 01 May 2017

Received: 01 July 2016

Published in TOMM Volume 13, Issue 4

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

Research Grants Council of the Hong Kong Special Administrative Region, China

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

14
Total Citations
View Citations
186
Total Downloads

Downloads (Last 12 months)7
Downloads (Last 6 weeks)0

Reflects downloads up to 03 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Wu JNgo CChan WGurrin CKongkachandra RSchoeffmann KDang-Nguyen DRossetto LSatoh SZhou L(2024)Improving Interpretable Embeddings for Ad-hoc Video Search with Generative Captions and Multi-word Concept BankProceedings of the 2024 International Conference on Multimedia Retrieval10.1145/3652583.3658052(73-82)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3652583.3658052
Zheng YZhang WSong WWang XFu C(2024)Encrypted Video Search with Single/Multiple WritersACM Transactions on Multimedia Computing, Communications, and Applications10.1145/3643887Online publication date: 5-Feb-2024
https://doi.org/10.1145/3643887
Wu JNgo CChan WHou Z(2023)(Un)likelihood Training for Interpretable EmbeddingACM Transactions on Information Systems10.1145/363275242:3(1-26)Online publication date: 13-Nov-2023
https://dl.acm.org/doi/10.1145/3632752
Ezaldeen HBisoy SMisra RAlatrash R(2023)Semantics aware intelligent framework for content-based e-learning recommendationNatural Language Processing Journal10.1016/j.nlp.2023.1000083(100008)Online publication date: Jun-2023
https://doi.org/10.1016/j.nlp.2023.100008
Jin YJiang WYang YMu Y(2022)Zero-Shot Video Event Detection With High-Order Semantic Concept Discovery and MatchingIEEE Transactions on Multimedia10.1109/TMM.2021.307362424(1896-1908)Online publication date: 2022
https://doi.org/10.1109/TMM.2021.3073624
Nguyen PNgo C(2021)Interactive Search vs. Automatic SearchACM Transactions on Multimedia Computing, Communications, and Applications10.1145/342945717:2(1-24)Online publication date: 11-May-2021
https://dl.acm.org/doi/10.1145/3429457
Hu YNie LLiu MWang KWang YHua X(2021)Coarse-to-Fine Semantic Alignment for Cross-Modal Moment LocalizationIEEE Transactions on Image Processing10.1109/TIP.2021.309052130(5933-5943)Online publication date: 1-Jan-2021
https://dl.acm.org/doi/10.1109/TIP.2021.3090521
Hu YLiu MSu XGao ZNie L(2021)Video Moment Localization via Deep Cross-Modal HashingIEEE Transactions on Image Processing10.1109/TIP.2021.307386730(4667-4677)Online publication date: 2021
https://doi.org/10.1109/TIP.2021.3073867
Rinaldi ARusso CTommasino C(2020)A Knowledge-Driven Multimedia Retrieval System Based on Semantics and Deep FeaturesFuture Internet10.3390/fi1211018312:11(183)Online publication date: 28-Oct-2020
https://doi.org/10.3390/fi12110183
de Boer MBakker BBoertjes EWilmer MRaaijmakers Svan der Kleij R(2019)Text Mining in Cybersecurity: Exploring Threats and OpportunitiesMultimodal Technologies and Interaction10.3390/mti30300623:3(62)Online publication date: 15-Sep-2019
https://doi.org/10.3390/mti3030062
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents