On the use of commonsense ontology for multimedia event recounting

Tan, Chun-Chet; Ngo, Chong-Wah

doi:10.1007/s13735-015-0090-3

On the use of commonsense ontology for multimedia event recounting

Regular Paper
Published: 30 November 2015

Volume 5, pages 73–88, (2016)
Cite this article

International Journal of Multimedia Information Retrieval Aims and scope Submit manuscript

Chun-Chet Tan¹ &
Chong-Wah Ngo¹

183 Accesses
2 Citations
Explore all metrics

Abstract

Textually narrating the observed evidences relevant to the reasons why a video clip is being retrieved for an event is still a highly challenging problem. This paper explores the use of a commonsense ontology, namely ConceptNet, in generating short descriptions for recounting the audio–visual evidences. The ontology is exploited as a knowledge engine to provide event–relevant common sense, which is expressed in terms of concepts and their relationships, for semantics understanding, context-based concept screening and sentence synthesis. A principal way of exploiting the ontology, from extracting the event–relevant semantic network to the formation of syntactic parse trees, is outlined and discussed. Experimental results on two benchmark datasets (TRECVID MED and MediaEval) show the effectiveness of our approach. The findings show insights on the usability of common sense for multimedia search, including the feasibility of inferring relevant concepts for event detection, as well as the quality of textual sentences in meeting human expectation.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

VITATECS: A Diagnostic Dataset for Temporal Concept Understanding of Video-Language Models

Knowledge based query expansion in complex multimedia event detection

Article Open access 12 July 2015

Find and Focus: Retrieve and Localize Video Events with Natural Language Queries

Notes

https://www.youtube.com/yt/press/statistics.html.
Refer to http://vireo.cs.cityu.edu.hk/mer_demo/networks.html for twenty event networks generated for TRECVID MED 2012.
-ing and -s are omitted in ConceptNet.

References

Boykov Y, Veksler O, Zabih R (2001) Fast approximate energy minimization via graph cuts. IEEE Trans Pattern Anal Mach Intell 23(11):1222–1239
Article Google Scholar
Chua TS, Tang J, Hong R, Li H, Luo Z, Zheng Y (2009) NUS-wide: a real-world web image database from National University of Singapore. In: Proceedings of CIVR, pp 48:1–48:9
Cilibrasi RL, Vitanyi PMB (2007) The Google similarity distance. IEEE Trans Knowl Data Eng 19(3):370–383
Article Google Scholar
Dalal N, Triggs B, Schmid C (2006) Human detection using oriented histograms of flow and appearance. In: Proceedings of ECCV, pp 428–441
Das P, Xu C, Doell RF, Corso JJ (2013) A thousand frames in just a few words: lingual description of videos through latent topics and sparse object stitching. In: Proceedings of CVPR, pp 2634–2641
Demarty CH, Penet C, Schedl M, Ionescu B, Quang VL, Jiang YG (2013) The MediaEval 2013 affect task: violent scenes detection. In: MediaEval workshop
Deng J, Dong W, Socher R, Jia Li L, Li K, Fei-Fei L (2009) ImageNet: a large-scale hierarchical image database. In: Proceedings of CVPR
Ding D, Metze F, Rawat S, Schulam PF, Burger S, Younessian E, Bao L, Christel MG, Hauptmann A (2012) Beyond audio and video retrieval: towards multimedia summarization. In: Proceedings of ICMR, pp 2:1–2:8
Farhadi A, Hejrati SMM, Sadeghi MA, Young P, Rashtchian C, Hockenmaier J, Forsyth DA (2010) Every picture tells a story: generating sentences from images. In: Proceedings of ECCV, pp 15–29
Guadarrama S, Krishnamoorthy N, Malkarnenkar G, Venugopalan S, Mooney RJ, Darrell T, Saenko K (2013) YouTube2Text: recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition. In: Proceedings of ICCV, pp 2712–2719
Gupta A, Verma Y, Jawahar CV (2012) Choosing linguistics over vision to describe images. In: Proceedings of AAAI
Izadinia H, Shah M (2012) Recognizing complex events using large margin joint low-level event model. In: Proceedings of ECCV, pp 430–444
Jiang YG, Dai Q, Wang J, Ngo CW, Xue X, Chang SF (2012) Fast semantic diffusion for large scale context-based image and video annotation. IEEE Trans Image Process 21(6):3080–3091
Article MathSciNet Google Scholar
Jiang YG, Ngo CW, Chang SF (2009) Semantic context transfer across heterogeneous sources for domain adaptive video search. In: Proceedings of ACM MM, pp 155–164
Jiang YG, Ye G, Chang SF, Ellis D, Loui AC (2011) Consumer video understanding: a benchmark database and an evaluation of human and machine performance. In: Proceedings of ICMR
Khan MUG, Zhang L, Gotoh Y (2011) Towards coherent natural language description of video streams. In: ICCV workshops, pp 664–671
Krishnamoorthy N, Malkarnenkar G, Mooney RJ, Saenko K, Guadarrama S (2013) Generating natural-language video descriptions using text-mined knowledge. In: Proceedings of AAAI
Kulkarni G, Premraj V, Dhar S, Li S, Choi Y, Berg AC, Berg TL (2011) Baby talk: understanding and generating simple image descriptions. In: Proceedings of CVPR, pp 1601–1608
Kuznetsova P, Ordonez V, Berg AC, Berg TL, Choi Y (2012) Collective generation of natural image descriptions. In: Proceedings of ACL, pp 359–368
Li S, Kulkarni G, Berg TL, Berg AC, Choi Y (2011) Composing simple image descriptions using web-scale n-grams. In: Proceedings of CoNLL, pp 220–228
Lin Y, Michel JB, Aiden EL, Orwant J, Brockman W, Petrov S (2012) Syntactic annotations for the google books n-gram corpus. In: Proceedings of ACL, pp 169–174
Liu H, Singh P (2004) Conceptnet—a practical commonsense reasoning tool-kit. BT Technol J 22(4):211–226
Article MathSciNet Google Scholar
Liu J, Yu Q, Javed O, Ali S, Tamrakar A, Divakaran A, Cheng H, Sawhney HS (2013) Video event recognition using concept attributes. In: Proceedings of WACV, pp 339–346
Ma Z, Hauptmann AG, Yang Y, Sebe N (2012) Classifier-specific intermediate representation for multimedia tasks. In: Proceedings of ICMR, pp 50:1–50:8
Marszałek M, Laptev I, Schmid C (2009) Actions in context. In: Proceedings of CVPR, pp 2929–2936
Mathieu B, Essid S, Fillon T, Prado J, Richard G (2010) Yaafe, an easy to use and efficient audio feature extraction software. In: Downie JS, Veltkamp RC (eds) Proceedings of ISMIR, pp 441–446
Mazloom M, Gavves E, van de Sande KEA, Snoek C (2013) Searching informative concept banks for video event detection. In: Proceedings of ICMR, pp 255–262
Merler M, Huang B, Xie L, Hua G, Natsev A (2012) Semantic model vectors for complex video event recognition. IEEE Trans Multimed 14(1):88–101
Article Google Scholar
Mitchell M, Han X, Dodge J, Mensch A, Goyal A, Berg A, Yamaguchi K, Berg T, Stratos K, Daumé III H (2012) Midge: generating image descriptions from computer vision detections. In: Proceedings of EACL, pp 747–756
Natarajan P, Wu S, Luisier F, Zhuang X, Tickoo M, Ye G, Liu D, Chang SF, Saleemi I, Shah M, Davis L, Gupta A, Haritaoglu I, Guler S, Morde A (2013) BBN VISER TRECVID 2013 multimedia event detection and multimedia event recounting systems. In: NIST TRECVID workshop
Nie F, Huang Y, Wang X, Huang H (2014) New primal SVM solver with linear computational cost for big data classifications. In: Proceedings of ICML
NIST, Information Technology Laboratory: 2012 TRECVID Multimedia Event Detection Track
NIST, Information Technology Laboratory: 2013 TRECVID Multimedia Event Recounting Track
Ordonez V, Kulkarni G, Berg TL (2011) Im2Text: describing images using 1 million captioned photographs. In: Proceedings of NIPS, pp 1143–1151
Rohrbach M, Qiu W, Titov I, Thater S, Pinkal M, Schiele B (2013) Translating video content to natural language descriptions. In: Proceedings of ICCV, pp 433–440
Romano J (1990) On the behavior of randomization tests without a group invariance assumption. J Am Stat Assoc 85(411):686–692
Article MathSciNet MATH Google Scholar
Snoek CGM, Worring M, van Gemert JC, Geusebroek JM, Smeulders AWM (2006) The challenge problem for automated detection of 101 semantic concepts in multimedia. In: Proceedings of ACM MM, pp 421–430
Speer R, Havasi C, Lieberman H (2008) Analogyspace: reducing the dimensionality of common sense knowledge. In: Proceedings of AAAI, pp 548–553
Sun C, Burns B, Nevatia R, Snoek CGM, Bolles B, Myers GK, Wang W, Yeh E (2014) Isomer: informative segment observations for multimedia event recounting. In: Proceedings of ICMR
Tan CC, Jiang YG, Ngo CW (2011) Towards textually describing complex video contents with audio–visual concept classifiers. In: Proceedings of ACM MM, pp 655–658
Tan CC, Ngo CW (2013) The vireo team at MediaEval 2013: violent scenes detection by mid-level concepts learnt from youtube. In: MediaEval, Proceedings of CEUR workshop, vol 1043
Torralba A, Murphy KP, Freeman WT (2010) Using the forest to see the trees: exploiting context for visual object detection and localization. Commun ACM 53(3):107–114
Article Google Scholar
Verma Y, Gupta A, Mannem P, Jawahar CV (2013) Generating image descriptions using semantic similarities in the output space. In: CVPR workshops, pp 288–293
Vinyals O, Toshev A, Bengio S, Erhan D (2014) Show and tell: a neural image caption generator. CoRR
Wang H, Kläser A, Schmid C, Liu CL (2011) Action recognition by dense trajectories. In: Proceedings of CVPR, pp 3169–3176. Colorado Springs, USA
Weng MF, Chuang YY (2012) Cross-domain multicue fusion for concept-based video indexing. IEEE Trans Pattern Anal Mach Intell 34(10):1927–1941
Article Google Scholar
Yanagawa A, Chang SF, Kennedy L, Hsu W (2007) Columbia University’s baseline detectors for 374 LSCOM semantic visual concepts. Technical report, Columbia University
Yang Y, Nie F, Xu D, Luo J, Zhuang Y, Pan Y (2012) A multimedia retrieval framework based on semi-supervised ranking and relevance feedback. IEEE Trans Pattern Anal Mach Intell 34(4):723–742
Article Google Scholar
Yang Y, Teo CL, Daumé III H, Aloimonos Y (2011) Corpus-guided sentence generation of natural images. In: Proceedings of EMNLP, pp 444–454
Zhang J, Marszałek M, Lazebnik S, Schmid C (2007) Local features and kernels for classification of texture and object categories: a comprehensive study. Int J Comput Vis 73(2):213–238
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, City University of Hong Kong, Hong Kong, China
Chun-Chet Tan & Chong-Wah Ngo

Authors

Chun-Chet Tan
View author publications
You can also search for this author in PubMed Google Scholar
Chong-Wah Ngo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chun-Chet Tan.

Additional information

The work described in this paper was fully supported by a Grant from the Research Grants Council of the Hong Kong Special Administrative Region, China (CityU 120213).

Rights and permissions

Reprints and permissions

About this article

Cite this article

Tan, CC., Ngo, CW. On the use of commonsense ontology for multimedia event recounting. Int J Multimed Info Retr 5, 73–88 (2016). https://doi.org/10.1007/s13735-015-0090-3

Download citation

Received: 20 January 2015
Revised: 03 June 2015
Accepted: 26 October 2015
Published: 30 November 2015
Issue Date: June 2016
DOI: https://doi.org/10.1007/s13735-015-0090-3

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

On the use of commonsense ontology for multimedia event recounting

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

VITATECS: A Diagnostic Dataset for Temporal Concept Understanding of Video-Language Models

Knowledge based query expansion in complex multimedia event detection

Find and Focus: Retrieve and Localize Video Events with Natural Language Queries

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

On the use of commonsense ontology for multimedia event recounting

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

VITATECS: A Diagnostic Dataset for Temporal Concept Understanding of Video-Language Models

Knowledge based query expansion in complex multimedia event detection

Find and Focus: Retrieve and Localize Video Events with Natural Language Queries

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation