Human interaction categorization by using audio-visual cues

Marín-Jiménez, M. J.; Muñoz-Salinas, R.; Yeguas-Bolivar, E.; Pérez de la Blanca, N.

doi:10.1007/s00138-013-0521-1

Human interaction categorization by using audio-visual cues

Special Issue Paper
Published: 01 June 2013

Volume 25, pages 71–84, (2014)
Cite this article

Machine Vision and Applications Aims and scope Submit manuscript

M. J. Marín-Jiménez¹,
R. Muñoz-Salinas¹,
E. Yeguas-Bolivar¹ &
…
N. Pérez de la Blanca²

652 Accesses
18 Citations
Explore all metrics

Abstract

Human Interaction Recognition (HIR) in uncontrolled TV video material is a very challenging problem because of the huge intra-class variability of the classes (due to large differences in the way actions are performed, lighting conditions and camera viewpoints, amongst others) as well as the existing small inter-class variability (e.g., the visual difference between hug and kiss is very subtle). Most of previous works have been focused only on visual information (i.e., image signal), thus missing an important source of information present in human interactions: the audio. So far, such approaches have not shown to be discriminative enough. This work proposes the use of Audio-Visual Bag of Words (AVBOW) as a more powerful mechanism to approach the HIR problem than the traditional Visual Bag of Words (VBOW). We show in this paper that the combined use of video and audio information yields to better classification results than video alone. Our approach has been validated in the challenging TVHID dataset showing that the proposed AVBOW provides statistically significant improvements over the VBOW employed in the related literature.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Human Interaction Recognition Based on the Co-occurring Visual Matrix Sequence

The future of action recognition: are multi-modal visual language models the key?

Article 28 February 2025

Improved use of descriptors for early recognition of actions in video

Article 01 July 2022

References

Bakker, E., Lew, M.: Semantic video retrieval using audio analysis. In: Lew, M., Sebe, N., Eakins, J. (eds.) Image and video retrieval. Lecture Notes in Computer Science, vol. 2383, pp. 201–218. Springer, International Conference on Image and Video Retrieval, London (2002)
Bredin, H., Koenig, L., Farinas, J.: Irit at trecvid 2010: Hidden markov models for context-aware late fusion of multiple audio classifiers. In: TRECVID 2010 Notebook papers (2010)
Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, vol. 1, pp. 886–893. IEEE Computer Society, Washington, DC (2005)
Dalal, N., Triggs, B., Schmid, C.: Human detection using oriented histograms of flow and appearance. In: Proceedings of the European Conference on Computer Vision (2006)
Davis, S., Mermelstein, P.: Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoust. Speech Signal Process. 28(4), 357–366 (1980)
Article Google Scholar
Delaitre, V., Sivic, J., Laptev, I.: Learning person-object interactions for action recognition in still images. In: Advances in Neural Information Processing Systems (2011)
Gorelick, L., Blank, M., Shechtman, E., Irani, M., Basri, R.: Actions as space-time shapes. Trans. Pattern Anal. Mach. Intell. 29(12), 2247–2253 (2007)
Article Google Scholar
Inoue, N., Wada, T., Kamishima, Y., Shinoda, K., Sato, S.: Tokyotech+canon at trecvid 2011. In: TRECVID 2011 Notebook papers (2011)
Jiang, Y.G., Ye, G., Chang, S.F., Ellis, D., Loui, A.C.: Consumer video understanding: a benchmark database and an evaluation of human and machine performance. In: Proceedings of ACM International Conference on Multimedia Retrieval (ICMR), oral session (2011)
Laptev, I.: On space-time interest points. Int. J. Comput. Vis. 64(2/3), 107–123 (2005)
Article Google Scholar
Laptev, I., Marszalek, M., Schmid, C., Rozenfeld, B.: Learning realistic human actions from movies. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2008)
Laptev, I., Pérez, P.: Retrieving actions in movies. In: Proceedings of the International Conference on Computer Vision, pp. 1–8 (2007)
Lartillot, O., Toiviainen, P.: MIR in Matlab (ii): a toolbox for musical feature extraction from audio. In: ISMIR, pp. 127–130 (2007)
Li, Y., Mou, L., Jiang, M., Su, C., Fang, X., Qian, M., Tian, Y., Wang, Y., Huang, T., Gao, W.: Pku-idm at trecvid 2010: copy detection with visual-audio feature fusion and sequential pyramid matching. In: TRECVID 2010 Notebook papers (2010)
MacQueen, J.B.: Some methods for classification and analysis of multivariate observations. In: Cam, L.M.L., Neyman, J. (eds.) Proceedings of the fifth Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, pp. 281–297. University of California Press (1967)
Mann, H.B., Whitney, D.R.: On a test of whether one of two random variables is stochastically larger than the other. Ann. Math. Stat. 18(1), 50–60 (1947)
Article MATH MathSciNet Google Scholar
Marin-Jimenez, M., Zisserman, A., Ferrari, V.: “Here’s looking at you, kid”. Detecting people looking at each other in videos. In: Proceedings of the British Machine Vision Conference (2011)
Marszałek, M., Laptev, I., Schmid, C.: Actions in context. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2009)
McCowan, I., Gatica-Perez, D., Bengio, S., Lathoud, G., Barnard, M., Zhang, D.: Automatic analysis of multimodal group actions in meetings. IEEE Trans. Pattern Anal. Mach. Intell. 27(3), 305–317 (2005)
Article Google Scholar
Patron-Perez, A., Marszalek, M., Reid, I., Zisserman, A.: Structured learning of human interactions in TV shows. IEEE Trans. Pattern Anal. Mach. Intell. 34(12), 2441–2453 (2012)
Article Google Scholar
Perera, A.G.A., Oh, S., Leotta, M., Kim, I., Byun, B., Lee, C.H., McCloskey, S., Liu, J., Miller, B., Huang, Z.F., Vahdat, A., Yang, W., Mori, G., Tang, K., Koller, D., Fei-Fei, L., Li, K., Chen, G., Corso, J., Fu, Y., Srihari, R.: Genie trecvid2011 multimedia event detection: late-fusion approaches to combine multiple audio-visual features. In: TRECVID 2011 Notebook papers (2011)
Press, W.H., Teukolsky, S.A., Vetterling, W., Flannery, B.P.: Numerical recipes in C++: the art of scientific computing, 2 edn. Cambridge University Press, Cambridge (2002)
Reid, I., Benfold, B., Patron, A., Sommerlade, E.: Understanding interactions and guiding visual surveillance by tracking attention. In: Koch, R., Huang, F. (eds.) Computer Vision—ACCV 2010 Workshops. Lecture Notes in Computer Science, vol. 6468, pp. 380–389. Springer, Berlin/Heidelberg (2011)
Ryoo, M., Aggarwal, J.: Spatio-temporal relationship match: Video structure comparison for recognition of complex human activities. In: Proceedings of the International Conference on Computer Vision, pp. 1593–1600 (2009)
Schüldt, C., Laptev, I., Caputo, B.: Recognizing human actions: a local SVM approach. In: Proceedings of the International Conference on Pattern Recognition, vol. 3, pp. 32–36, Cambridge (2004)
Sidiropoulos, P., Mezaris, V., Kompatsiaris, I., Meinedo, H., Bugalho, M., Trancoso, I.: On the use of audio events for improving video scene segmentation. In: 2010 11th International Workshop on Image Analysis for Multimedia Interactive Services (WIAMIS), pp. 1–4 (2010)
Sivic, J., Zisserman, A.: Video Google: a text retrieval approach to object matching in videos. In: Proceedings of the International Conference on Computer Vision, vol. 2, pp. 1470–1477 (2003)
Smeaton, A.F., Over, P., Kraaij, W.: Evaluation campaigns and TRECVid. In: MIR ’06: Proceedings of the 8th ACM International Workshop on Multimedia Information Retrieval, pp. 321–330. ACM Press, New York (2006)
Stevens, S., Volkmann, J., Newman, E.: A scale for the measurement of the psychological magnitude of pitch. J. Acoust. Soc. Am. 8, 185–190 (1937)
Article Google Scholar
Turaga, P., Chellappa, R., Subrahmanian, V.S., Udrea, O.: Machine recognition of human activities: a survey. IEEE Trans. Circ. Syst. Video Technol. 18(11), 1473–1488 (2008)
Google Scholar
Tzanetakis, G., Chen, M.: Building audio classification for broadcast news retrieval. In: Proceedings of WIAMIS (2004)
Varma, M., Babu, B.R.: More generality in efficient multiple kernel learning. In: ICML, p. 134 (2009)
Vedaldi, A., Fulkerson, B.: VLFeat: an open and portable library of computer vision algorithms. http://www.vlfeat.org/ (2008)
Vedaldi, A., Zisserman, A.: Efficient additive kernels via explicit feature maps. IEEE Trans. Pattern Anal. Mach. Intell. 34(3), 480–492 (2012)
Article Google Scholar
Vishwanathan, S.V.N., Sun, Z., Theera-Ampornpunt, N., Varma, M.: Multiple kernel learning and the SMO algorithm. In: Advances in Neural Information Processing Systems (2010)
Wang, H., Ullah, M., Kläser, A., Laptev, I., Schmid, C.: Evaluation of local spatio-temporal features for action recognition. In: Proceedings of the British Machine Vision Conference, p. 127 (2009)
Weinland, D., Ronfard, R., Boyer, E.: Free viewpoint action recognition using motion history volumes. Comput. Vis. Image Underst. 104(2–3), 249–257 (2006)
Google Scholar
Wilcoxon, F.: Individual comparisons by ranking methods. Biom. Bull. 1(6), 80–83 (1945)
Article Google Scholar
Yao, B., Jiang, X., Khosla, A., Lin, A., Guibas, L., Fei-Fei, L.: Action recognition by learning bases of action attributes and parts. In: Proceedings of the International Conference on Computer Vision, Barcelona, Spain (2011)
Ye, G., Jhuo, I.H., Liu, D., Jiang, Y.G., Lee, D.T., Chang, S.F.: Joint audio-visual bi-modal codewords for video event detection. In: ICMR, p. 39 (2012)
Ye, G., Liu, D., Jhuo, I.H., Chang, S.F.: Robust late fusion with rank minimization. In: Proceedings of the IEEE Conference on Computer Vision and, Pattern Recognition, pp. 3021–3028 (2012)

Download references

Acknowledgments

This research was partially supported by the Spanish Ministry of Economy and Competitiveness under projects P10-TIC-6762, TIN2010-15137, TIN2012-32952 and BROCA, and the European Regional Development Fund (FEDER).

Author information

Authors and Affiliations

Department of Computing and Numerical Analysis, Maimonides Institute for Biomedical Research (IMIBIC), University of Córdoba, 14071 , Córdoba, Spain
M. J. Marín-Jiménez, R. Muñoz-Salinas & E. Yeguas-Bolivar
Department of Computer Science and Artificial Intelligence, University of Granada, 18071 , Granada, Spain
N. Pérez de la Blanca

Authors

M. J. Marín-Jiménez
View author publications
You can also search for this author in PubMed Google Scholar
R. Muñoz-Salinas
View author publications
You can also search for this author in PubMed Google Scholar
E. Yeguas-Bolivar
View author publications
You can also search for this author in PubMed Google Scholar
N. Pérez de la Blanca
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to M. J. Marín-Jiménez.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Marín-Jiménez, M.J., Muñoz-Salinas, R., Yeguas-Bolivar, E. et al. Human interaction categorization by using audio-visual cues. Machine Vision and Applications 25, 71–84 (2014). https://doi.org/10.1007/s00138-013-0521-1

Download citation

Received: 31 July 2012
Revised: 03 May 2013
Accepted: 10 May 2013
Published: 01 June 2013
Issue Date: January 2014
DOI: https://doi.org/10.1007/s00138-013-0521-1

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Human interaction categorization by using audio-visual cues

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Human Interaction Recognition Based on the Co-occurring Visual Matrix Sequence

The future of action recognition: are multi-modal visual language models the key?

Improved use of descriptors for early recognition of actions in video

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now