Abstract
Multimedia Event Detection (MED) is the task to identify videos in which a certain event occurs. This paper addresses two problems in MED: weakly supervised setting and unclear event structure. The first indicates that since associations of shots with the event are laborious and incur annotator’s subjectivity, training videos are loosely annotated as to whether the event is contained or not. It is unknown which shots are relevant or irrelevant to the event. The second problem is the difficulty of assuming the event structure in advance, due to arbitrary camera and editing techniques. To tackle these problems, we propose a method using a Hidden Conditional Random Field (HCRF) which is a probabilistic discriminative classifier with a set of hidden states. We consider that the weakly supervised setting can be handled using hidden states as the intermediate layer to discriminate between relevant and irrelevant shots to the event. In addition, an unclear structure of the event can be exposed by features of each hidden state and its relation to the other states. Based on the above idea, we optimise hidden states and their relation so as to distinguish training videos containing the event from the others. Also, to exploit the full potential of HCRFs, we establish approaches for training video preparation, parameter initialisation and fusion of multiple HCRFs. Experimental results on TRECVID video data validate the effectiveness of our method.
Similar content being viewed by others
Notes
It is not reasonable to initialise \(\varvec{\theta }_\mathrm{weight}(h_{i})\) as the centre of the \(i\)th cluster because of the difference of value ranges. While \(\varvec{\theta }_\mathrm{weight}(h_{i})\) takes both positive and negative values, the cluster centre does not take negative ones because concept detection scores lie between \(0\) and \(1\).
We also tested PCA to make each dimension (concept) independent of each other, and the normalisation to obtain uniformed dimensions with the mean zero and the variance one. However, neither of them worked well. It can be considered that detection scores for each concept are appropriately biased by the detector, so editing their distribution does not offer improvement.
References
Aly R et al (2012) AXES at TRECVid 2012: KIS, INS, and MED. In: Proceedings of TRECVID 2012. http://www-nlpir.nist.gov/projects/tvpubs/tv12.papers/axes.pdf
Ando R, Shinoda K, Furui S, Mochizuki T (2006) Robust scene recognition using language models for scene contexts. In: Proceedings of MIR 2006, pp 99–106
Arijon, D (1976) Grammar of the film language. Silman-James Press, Los Angeles
Ayache S, Quénot G (2008) Video corpus annotation using active learning. In: Proceedings of ECIR 2008, pp 187–198
Barnard M, Odobez J (2005) Sports event recognition using layered HMMs. In: Proceedings of ICME 2005, pp 1150–1153
Breiman L (1996) Bagging predictors. Mach Learn 24(2):123–140
Cheng H et al. (2012) SRI-Sarnoff AURORA system at TRECVID 2012: Multimedia event detection and recounting. In: Proceedings of TRECVID 2012. http://www-nlpir.nist.gov/projects/tvpubs/tv12.papers/aurora.pdf
Davenport G, Smith TA, Pincever N (1991) Cinematic primitives for multimedia. IEEE Comput Graph Appl 11(4):67–74
Fujisawa M (2012) Bayon—a simple and fast clustering tool. http://code.google.com/p/bayon/
Gemmell DJ, Vin HM, Kandlur DD, Rangan PV, Rowe LA (1995) Multimedia storage servers: a tutorial. IEEE Comput 28(5):40–49
Gunawardana A, Mahajan M, Acero A, Platt JC (2005) Hidden conditional random fields for phone classification. In: Proceedings of INTERSPEECH 2005, pp 1117–1120
He H, Garcia EA (2009) Learning from imbalanced data. IEEE Trans Knowl Data Eng 21(9):1263–1284
Inoue N, Wada T, Kamishima Y, Shinoda K, Sato S (2011) TokyoTech+Canon at TRECVID 2011. In: Proceedings of TRECVID 2011. http://www-nlpir.nist.gov/projects/tvpubs/tv11.papers/tokyotechcanon.pdf
Jiang YG, Bhattacharya S, Chang SF, Shah M (2013) High-level event recognition in unconstrained videos. Int J Multimed Inf Retr 2(2):73–101
Lafferty JD, McCallum A, Pereira FCN (2001) Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of ICML 2001, pp 282–289
Li W, Yu Q, Divakaran A, Vasconcelos N (2013) Dynamic pooling for complex event recognition. In: Proceedings of ICCV 2013, pp 2728–2735
Li X, Snoek CGM (2009) Visual categorization with negative examples for free. In: Proceedings of MM 2009, pp 661–664
Liu J, McCloskey S, Liu Y (2012) Local expert forest of score fusion for video event classification. In: Proceedings of ECCV 2012, pp 397–410
Mann TP (2006) Numerically stable hidden Markov model implementation. http://bozeman.genome.washington.edu/compbio/mbt599_2006/hmm_scaling_revised.pdf, HMM Scaling Tutorial
Naphade M et al (2006) Large-scale concept ontology for multimedia. IEEE Multimed 13(3):86–91
Quattoni A, Wang S, Morency L, Collins M, Darrell T (2007) Hidden conditional random fields. IEEE Trans Pattern Anal Mach Intell 29(10):1848–1852
Rui Y, Huang TS, Mehrotra S (1999) Constructing table-of-content for videos. Multimed Syst 7(5):359–368
Shirahama K, Uehara K (2008) A novel topic extraction method based on bursts in video streams. Int J Hybrid Inf Technol 1(3):21–32
Shirahama K, Uehara K (2012) Kobe university and Muroran institute of technology at TRECVID 2012 semantic indexing task. In: Proceedings of TRECVID 2012. http://www-nlpir.nist.gov/projects/tvpubs/tv12.papers/kobe-muroran.pdf
Smeaton AF, Over P, Kraaij W (2006) Evaluation campaigns and TRECVid. In: Proceedings of MIR 2006, pp 321–330
Smucker MD, Allan J, Carterette B (2007) A comparison of statistical significance tests for information retrieval evaluation. In: Proceedings of CIKM 2007, pp 623–632
Snoek CGM, Worring M (2009) Concept-based video retrieval. Found Trends Inf Retr 2(4):215–322
Strassel, S et al. (2012) Creating HAVIC: heterogeneous audio visual internet collection. In: Proceedings of LREC 2012, pp 2573–2577
Sun C, Nevatia R (2013) ACTIVE: activity concept transitions in video event classification. In: Proceedings of ICCV 2013, pp 913–920
Tanaka K, Ariki Y, Uehara K (1999) Organization and retrieval of video data. IEICE Trans Inf Syst 82(1):34–44
Vahdat A, Cannons K, Mori G, Oh S, Kim I (2013) Compositional models for video event detection: a multiple kernel learning latent variable approach. In: Proceedings of ICCV 2013, pp 1185– 1192
Wang SB, Quattoni A, Morency L, Demirdjian D, Darrell T (2006a) Hidden conditional random fields for gesture recognition. In: Proceedings of CVPR 2006, pp 1521–1527
Wang T, Li J, Diao Q, Hu W, Zhang Y, Dulong C (2006b) Semantic event detection using conditional random fields. In: Proceedings of CVPRW 2006
Yin J, Hu DH, Yang Q (2009) Spatio-temporal event detection using dynamic conditional random fields. In: Proceedings of IJCAI 2009, pp 1321–1326
Young S et al (2009) The HTK Book (for HTK Version 3.4). Cambridge University Engineering Department. http://htk.eng.cam.ac.uk/
Yu H, Han J, Chang KC (2004) PEBL: Web page classification without negative examples. IEEE Trans Knowl Data Eng 16(1):70–81
Zhai Y, Rasheed Z, Shah M (2004) A framework for semantic classification of scenes using finite state machines. In: Proceedings of CIVR 2004, pp 279–288
Zhang J, Gong S (2010) Action categorization with modified hidden conditional random field. Pattern Recognit 43(1):197–203
Acknowledgments
The research work by Kimiaki Shirahama leading to this article has been funded by the Postdoctoral Fellowship for Research Abroad by Japan Society for the Promotion of Science (JSPS). Also, this work was in part supported by JSPS through Grand-in-Aid for Scientific Research (B): KAKENHI (26280040).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Shirahama, K., Grzegorzek, M. & Uehara, K. Weakly supervised detection of video events using hidden conditional random fields. Int J Multimed Info Retr 4, 17–32 (2015). https://doi.org/10.1007/s13735-014-0068-6
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13735-014-0068-6