Weakly supervised detection of video events using hidden conditional random fields

Shirahama, Kimiaki; Grzegorzek, Marcin; Uehara, Kuniaki

doi:10.1007/s13735-014-0068-6

Weakly supervised detection of video events using hidden conditional random fields

Regular Paper
Published: 28 September 2014

Volume 4, pages 17–32, (2015)
Cite this article

International Journal of Multimedia Information Retrieval Aims and scope Submit manuscript

Kimiaki Shirahama¹,
Marcin Grzegorzek¹ &
Kuniaki Uehara²

220 Accesses
2 Citations
Explore all metrics

Abstract

Multimedia Event Detection (MED) is the task to identify videos in which a certain event occurs. This paper addresses two problems in MED: weakly supervised setting and unclear event structure. The first indicates that since associations of shots with the event are laborious and incur annotator’s subjectivity, training videos are loosely annotated as to whether the event is contained or not. It is unknown which shots are relevant or irrelevant to the event. The second problem is the difficulty of assuming the event structure in advance, due to arbitrary camera and editing techniques. To tackle these problems, we propose a method using a Hidden Conditional Random Field (HCRF) which is a probabilistic discriminative classifier with a set of hidden states. We consider that the weakly supervised setting can be handled using hidden states as the intermediate layer to discriminate between relevant and irrelevant shots to the event. In addition, an unclear structure of the event can be exposed by features of each hidden state and its relation to the other states. Based on the above idea, we optimise hidden states and their relation so as to distinguish training videos containing the event from the others. Also, to exploit the full potential of HCRFs, we establish approaches for training video preparation, parameter initialisation and fusion of multiple HCRFs. Experimental results on TRECVID video data validate the effectiveness of our method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Online Aggregated-Event Representation for Multiple Event Detection in Videos

Semi-supervised Early Event Detection

Advancing Weakly-Supervised Audio-Visual Video Parsing via Segment-Wise Pseudo Labeling

Article 09 June 2024

Notes

It is not reasonable to initialise $\varvec{\theta }_\mathrm{weight}(h_{i})$ as the centre of the $i$th cluster because of the difference of value ranges. While $\varvec{\theta }_\mathrm{weight}(h_{i})$ takes both positive and negative values, the cluster centre does not take negative ones because concept detection scores lie between $0$ and $1$.
We also tested PCA to make each dimension (concept) independent of each other, and the normalisation to obtain uniformed dimensions with the mean zero and the variance one. However, neither of them worked well. It can be considered that detection scores for each concept are appropriately biased by the detector, so editing their distribution does not offer improvement.

References

Aly R et al (2012) AXES at TRECVid 2012: KIS, INS, and MED. In: Proceedings of TRECVID 2012. http://www-nlpir.nist.gov/projects/tvpubs/tv12.papers/axes.pdf
Ando R, Shinoda K, Furui S, Mochizuki T (2006) Robust scene recognition using language models for scene contexts. In: Proceedings of MIR 2006, pp 99–106
Arijon, D (1976) Grammar of the film language. Silman-James Press, Los Angeles
Ayache S, Quénot G (2008) Video corpus annotation using active learning. In: Proceedings of ECIR 2008, pp 187–198
Barnard M, Odobez J (2005) Sports event recognition using layered HMMs. In: Proceedings of ICME 2005, pp 1150–1153
Breiman L (1996) Bagging predictors. Mach Learn 24(2):123–140
MATH MathSciNet Google Scholar
Cheng H et al. (2012) SRI-Sarnoff AURORA system at TRECVID 2012: Multimedia event detection and recounting. In: Proceedings of TRECVID 2012. http://www-nlpir.nist.gov/projects/tvpubs/tv12.papers/aurora.pdf
Davenport G, Smith TA, Pincever N (1991) Cinematic primitives for multimedia. IEEE Comput Graph Appl 11(4):67–74
Article Google Scholar
Fujisawa M (2012) Bayon—a simple and fast clustering tool. http://code.google.com/p/bayon/
Gemmell DJ, Vin HM, Kandlur DD, Rangan PV, Rowe LA (1995) Multimedia storage servers: a tutorial. IEEE Comput 28(5):40–49
Article Google Scholar
Gunawardana A, Mahajan M, Acero A, Platt JC (2005) Hidden conditional random fields for phone classification. In: Proceedings of INTERSPEECH 2005, pp 1117–1120
He H, Garcia EA (2009) Learning from imbalanced data. IEEE Trans Knowl Data Eng 21(9):1263–1284
Article Google Scholar
Inoue N, Wada T, Kamishima Y, Shinoda K, Sato S (2011) TokyoTech+Canon at TRECVID 2011. In: Proceedings of TRECVID 2011. http://www-nlpir.nist.gov/projects/tvpubs/tv11.papers/tokyotechcanon.pdf
Jiang YG, Bhattacharya S, Chang SF, Shah M (2013) High-level event recognition in unconstrained videos. Int J Multimed Inf Retr 2(2):73–101
Article Google Scholar
Lafferty JD, McCallum A, Pereira FCN (2001) Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of ICML 2001, pp 282–289
Li W, Yu Q, Divakaran A, Vasconcelos N (2013) Dynamic pooling for complex event recognition. In: Proceedings of ICCV 2013, pp 2728–2735
Li X, Snoek CGM (2009) Visual categorization with negative examples for free. In: Proceedings of MM 2009, pp 661–664
Liu J, McCloskey S, Liu Y (2012) Local expert forest of score fusion for video event classification. In: Proceedings of ECCV 2012, pp 397–410
Mann TP (2006) Numerically stable hidden Markov model implementation. http://bozeman.genome.washington.edu/compbio/mbt599_2006/hmm_scaling_revised.pdf, HMM Scaling Tutorial
Naphade M et al (2006) Large-scale concept ontology for multimedia. IEEE Multimed 13(3):86–91
Article Google Scholar
Quattoni A, Wang S, Morency L, Collins M, Darrell T (2007) Hidden conditional random fields. IEEE Trans Pattern Anal Mach Intell 29(10):1848–1852
Article Google Scholar
Rui Y, Huang TS, Mehrotra S (1999) Constructing table-of-content for videos. Multimed Syst 7(5):359–368
Article Google Scholar
Shirahama K, Uehara K (2008) A novel topic extraction method based on bursts in video streams. Int J Hybrid Inf Technol 1(3):21–32
Google Scholar
Shirahama K, Uehara K (2012) Kobe university and Muroran institute of technology at TRECVID 2012 semantic indexing task. In: Proceedings of TRECVID 2012. http://www-nlpir.nist.gov/projects/tvpubs/tv12.papers/kobe-muroran.pdf
Smeaton AF, Over P, Kraaij W (2006) Evaluation campaigns and TRECVid. In: Proceedings of MIR 2006, pp 321–330
Smucker MD, Allan J, Carterette B (2007) A comparison of statistical significance tests for information retrieval evaluation. In: Proceedings of CIKM 2007, pp 623–632
Snoek CGM, Worring M (2009) Concept-based video retrieval. Found Trends Inf Retr 2(4):215–322
Article Google Scholar
Strassel, S et al. (2012) Creating HAVIC: heterogeneous audio visual internet collection. In: Proceedings of LREC 2012, pp 2573–2577
Sun C, Nevatia R (2013) ACTIVE: activity concept transitions in video event classification. In: Proceedings of ICCV 2013, pp 913–920
Tanaka K, Ariki Y, Uehara K (1999) Organization and retrieval of video data. IEICE Trans Inf Syst 82(1):34–44
Google Scholar
Vahdat A, Cannons K, Mori G, Oh S, Kim I (2013) Compositional models for video event detection: a multiple kernel learning latent variable approach. In: Proceedings of ICCV 2013, pp 1185– 1192
Wang SB, Quattoni A, Morency L, Demirdjian D, Darrell T (2006a) Hidden conditional random fields for gesture recognition. In: Proceedings of CVPR 2006, pp 1521–1527
Wang T, Li J, Diao Q, Hu W, Zhang Y, Dulong C (2006b) Semantic event detection using conditional random fields. In: Proceedings of CVPRW 2006
Yin J, Hu DH, Yang Q (2009) Spatio-temporal event detection using dynamic conditional random fields. In: Proceedings of IJCAI 2009, pp 1321–1326
Young S et al (2009) The HTK Book (for HTK Version 3.4). Cambridge University Engineering Department. http://htk.eng.cam.ac.uk/
Yu H, Han J, Chang KC (2004) PEBL: Web page classification without negative examples. IEEE Trans Knowl Data Eng 16(1):70–81
Article Google Scholar
Zhai Y, Rasheed Z, Shah M (2004) A framework for semantic classification of scenes using finite state machines. In: Proceedings of CIVR 2004, pp 279–288
Zhang J, Gong S (2010) Action categorization with modified hidden conditional random field. Pattern Recognit 43(1):197–203
Article MATH MathSciNet Google Scholar

Download references

Acknowledgments

The research work by Kimiaki Shirahama leading to this article has been funded by the Postdoctoral Fellowship for Research Abroad by Japan Society for the Promotion of Science (JSPS). Also, this work was in part supported by JSPS through Grand-in-Aid for Scientific Research (B): KAKENHI (26280040).

Author information

Authors and Affiliations

Pattern Recognition Group, University of Siegen, Hoelderlinstr. 3, 57076, Siegen, Germany
Kimiaki Shirahama & Marcin Grzegorzek
Graduate School of System Informatics, Kobe University, 1-1, Rokkodai, Nada, Kobe, 657-8501, Japan
Kuniaki Uehara

Authors

Kimiaki Shirahama
View author publications
You can also search for this author in PubMed Google Scholar
Marcin Grzegorzek
View author publications
You can also search for this author in PubMed Google Scholar
Kuniaki Uehara
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kimiaki Shirahama.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Shirahama, K., Grzegorzek, M. & Uehara, K. Weakly supervised detection of video events using hidden conditional random fields. Int J Multimed Info Retr 4, 17–32 (2015). https://doi.org/10.1007/s13735-014-0068-6

Download citation

Received: 30 June 2014
Revised: 29 August 2014
Accepted: 11 September 2014
Published: 28 September 2014
Issue Date: March 2015
DOI: https://doi.org/10.1007/s13735-014-0068-6

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Weakly supervised detection of video events using hidden conditional random fields

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Online Aggregated-Event Representation for Multiple Event Detection in Videos

Semi-supervised Early Event Detection

Advancing Weakly-Supervised Audio-Visual Video Parsing via Segment-Wise Pseudo Labeling

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Weakly supervised detection of video events using hidden conditional random fields

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Online Aggregated-Event Representation for Multiple Event Detection in Videos

Semi-supervised Early Event Detection

Advancing Weakly-Supervised Audio-Visual Video Parsing via Segment-Wise Pseudo Labeling

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation