Abstract
Human Interaction Recognition (HIR) in uncontrolled TV video material is a very challenging problem because of the huge intra-class variability of the classes (due to large differences in the way actions are performed, lighting conditions and camera viewpoints, amongst others) as well as the existing small inter-class variability (e.g., the visual difference between hug and kiss is very subtle). Most of previous works have been focused only on visual information (i.e., image signal), thus missing an important source of information present in human interactions: the audio. So far, such approaches have not shown to be discriminative enough. This work proposes the use of Audio-Visual Bag of Words (AVBOW) as a more powerful mechanism to approach the HIR problem than the traditional Visual Bag of Words (VBOW). We show in this paper that the combined use of video and audio information yields to better classification results than video alone. Our approach has been validated in the challenging TVHID dataset showing that the proposed AVBOW provides statistically significant improvements over the VBOW employed in the related literature.
Similar content being viewed by others
References
Bakker, E., Lew, M.: Semantic video retrieval using audio analysis. In: Lew, M., Sebe, N., Eakins, J. (eds.) Image and video retrieval. Lecture Notes in Computer Science, vol. 2383, pp. 201–218. Springer, International Conference on Image and Video Retrieval, London (2002)
Bredin, H., Koenig, L., Farinas, J.: Irit at trecvid 2010: Hidden markov models for context-aware late fusion of multiple audio classifiers. In: TRECVID 2010 Notebook papers (2010)
Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, vol. 1, pp. 886–893. IEEE Computer Society, Washington, DC (2005)
Dalal, N., Triggs, B., Schmid, C.: Human detection using oriented histograms of flow and appearance. In: Proceedings of the European Conference on Computer Vision (2006)
Davis, S., Mermelstein, P.: Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoust. Speech Signal Process. 28(4), 357–366 (1980)
Delaitre, V., Sivic, J., Laptev, I.: Learning person-object interactions for action recognition in still images. In: Advances in Neural Information Processing Systems (2011)
Gorelick, L., Blank, M., Shechtman, E., Irani, M., Basri, R.: Actions as space-time shapes. Trans. Pattern Anal. Mach. Intell. 29(12), 2247–2253 (2007)
Inoue, N., Wada, T., Kamishima, Y., Shinoda, K., Sato, S.: Tokyotech+canon at trecvid 2011. In: TRECVID 2011 Notebook papers (2011)
Jiang, Y.G., Ye, G., Chang, S.F., Ellis, D., Loui, A.C.: Consumer video understanding: a benchmark database and an evaluation of human and machine performance. In: Proceedings of ACM International Conference on Multimedia Retrieval (ICMR), oral session (2011)
Laptev, I.: On space-time interest points. Int. J. Comput. Vis. 64(2/3), 107–123 (2005)
Laptev, I., Marszalek, M., Schmid, C., Rozenfeld, B.: Learning realistic human actions from movies. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2008)
Laptev, I., Pérez, P.: Retrieving actions in movies. In: Proceedings of the International Conference on Computer Vision, pp. 1–8 (2007)
Lartillot, O., Toiviainen, P.: MIR in Matlab (ii): a toolbox for musical feature extraction from audio. In: ISMIR, pp. 127–130 (2007)
Li, Y., Mou, L., Jiang, M., Su, C., Fang, X., Qian, M., Tian, Y., Wang, Y., Huang, T., Gao, W.: Pku-idm at trecvid 2010: copy detection with visual-audio feature fusion and sequential pyramid matching. In: TRECVID 2010 Notebook papers (2010)
MacQueen, J.B.: Some methods for classification and analysis of multivariate observations. In: Cam, L.M.L., Neyman, J. (eds.) Proceedings of the fifth Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, pp. 281–297. University of California Press (1967)
Mann, H.B., Whitney, D.R.: On a test of whether one of two random variables is stochastically larger than the other. Ann. Math. Stat. 18(1), 50–60 (1947)
Marin-Jimenez, M., Zisserman, A., Ferrari, V.: “Here’s looking at you, kid”. Detecting people looking at each other in videos. In: Proceedings of the British Machine Vision Conference (2011)
Marszałek, M., Laptev, I., Schmid, C.: Actions in context. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2009)
McCowan, I., Gatica-Perez, D., Bengio, S., Lathoud, G., Barnard, M., Zhang, D.: Automatic analysis of multimodal group actions in meetings. IEEE Trans. Pattern Anal. Mach. Intell. 27(3), 305–317 (2005)
Patron-Perez, A., Marszalek, M., Reid, I., Zisserman, A.: Structured learning of human interactions in TV shows. IEEE Trans. Pattern Anal. Mach. Intell. 34(12), 2441–2453 (2012)
Perera, A.G.A., Oh, S., Leotta, M., Kim, I., Byun, B., Lee, C.H., McCloskey, S., Liu, J., Miller, B., Huang, Z.F., Vahdat, A., Yang, W., Mori, G., Tang, K., Koller, D., Fei-Fei, L., Li, K., Chen, G., Corso, J., Fu, Y., Srihari, R.: Genie trecvid2011 multimedia event detection: late-fusion approaches to combine multiple audio-visual features. In: TRECVID 2011 Notebook papers (2011)
Press, W.H., Teukolsky, S.A., Vetterling, W., Flannery, B.P.: Numerical recipes in C++: the art of scientific computing, 2 edn. Cambridge University Press, Cambridge (2002)
Reid, I., Benfold, B., Patron, A., Sommerlade, E.: Understanding interactions and guiding visual surveillance by tracking attention. In: Koch, R., Huang, F. (eds.) Computer Vision—ACCV 2010 Workshops. Lecture Notes in Computer Science, vol. 6468, pp. 380–389. Springer, Berlin/Heidelberg (2011)
Ryoo, M., Aggarwal, J.: Spatio-temporal relationship match: Video structure comparison for recognition of complex human activities. In: Proceedings of the International Conference on Computer Vision, pp. 1593–1600 (2009)
Schüldt, C., Laptev, I., Caputo, B.: Recognizing human actions: a local SVM approach. In: Proceedings of the International Conference on Pattern Recognition, vol. 3, pp. 32–36, Cambridge (2004)
Sidiropoulos, P., Mezaris, V., Kompatsiaris, I., Meinedo, H., Bugalho, M., Trancoso, I.: On the use of audio events for improving video scene segmentation. In: 2010 11th International Workshop on Image Analysis for Multimedia Interactive Services (WIAMIS), pp. 1–4 (2010)
Sivic, J., Zisserman, A.: Video Google: a text retrieval approach to object matching in videos. In: Proceedings of the International Conference on Computer Vision, vol. 2, pp. 1470–1477 (2003)
Smeaton, A.F., Over, P., Kraaij, W.: Evaluation campaigns and TRECVid. In: MIR ’06: Proceedings of the 8th ACM International Workshop on Multimedia Information Retrieval, pp. 321–330. ACM Press, New York (2006)
Stevens, S., Volkmann, J., Newman, E.: A scale for the measurement of the psychological magnitude of pitch. J. Acoust. Soc. Am. 8, 185–190 (1937)
Turaga, P., Chellappa, R., Subrahmanian, V.S., Udrea, O.: Machine recognition of human activities: a survey. IEEE Trans. Circ. Syst. Video Technol. 18(11), 1473–1488 (2008)
Tzanetakis, G., Chen, M.: Building audio classification for broadcast news retrieval. In: Proceedings of WIAMIS (2004)
Varma, M., Babu, B.R.: More generality in efficient multiple kernel learning. In: ICML, p. 134 (2009)
Vedaldi, A., Fulkerson, B.: VLFeat: an open and portable library of computer vision algorithms. http://www.vlfeat.org/ (2008)
Vedaldi, A., Zisserman, A.: Efficient additive kernels via explicit feature maps. IEEE Trans. Pattern Anal. Mach. Intell. 34(3), 480–492 (2012)
Vishwanathan, S.V.N., Sun, Z., Theera-Ampornpunt, N., Varma, M.: Multiple kernel learning and the SMO algorithm. In: Advances in Neural Information Processing Systems (2010)
Wang, H., Ullah, M., Kläser, A., Laptev, I., Schmid, C.: Evaluation of local spatio-temporal features for action recognition. In: Proceedings of the British Machine Vision Conference, p. 127 (2009)
Weinland, D., Ronfard, R., Boyer, E.: Free viewpoint action recognition using motion history volumes. Comput. Vis. Image Underst. 104(2–3), 249–257 (2006)
Wilcoxon, F.: Individual comparisons by ranking methods. Biom. Bull. 1(6), 80–83 (1945)
Yao, B., Jiang, X., Khosla, A., Lin, A., Guibas, L., Fei-Fei, L.: Action recognition by learning bases of action attributes and parts. In: Proceedings of the International Conference on Computer Vision, Barcelona, Spain (2011)
Ye, G., Jhuo, I.H., Liu, D., Jiang, Y.G., Lee, D.T., Chang, S.F.: Joint audio-visual bi-modal codewords for video event detection. In: ICMR, p. 39 (2012)
Ye, G., Liu, D., Jhuo, I.H., Chang, S.F.: Robust late fusion with rank minimization. In: Proceedings of the IEEE Conference on Computer Vision and, Pattern Recognition, pp. 3021–3028 (2012)
Acknowledgments
This research was partially supported by the Spanish Ministry of Economy and Competitiveness under projects P10-TIC-6762, TIN2010-15137, TIN2012-32952 and BROCA, and the European Regional Development Fund (FEDER).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Marín-Jiménez, M.J., Muñoz-Salinas, R., Yeguas-Bolivar, E. et al. Human interaction categorization by using audio-visual cues. Machine Vision and Applications 25, 71–84 (2014). https://doi.org/10.1007/s00138-013-0521-1
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00138-013-0521-1