Abstract
Shapes convey meaning. Language is efficient in expressing and structuring meaning. The main thesis of this chapter is that by integrating shape with linguistic information shape recognition can be improved in performance. It broadens the concept of shape to visual shapes that include both geometric and optical information and explores ways that additional linguistic information may help with shape recognition. Towards this goal, it briefly describes some shape categories which have the potential of better recognition via language, with emphasis on gestures and moving shapes of sign language, as well as on cross-modal relations between vision and language in videos. It also draws inspiration from psychological studies that explore connections between gestures and human languages. Afterwards, it focuses on the broad class of multimodal gestures that combine spatio-temporal visual shapes with audio information. In this area, an approach is reviewed that significantly improves multimodal gesture recognition by fusing 3D shape information from motion-position of gesturing hands/arms and spatio-temporal handshapes in color and depth visual channels with audio information in the form of acoustically recognized sequences of gesture words.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
In the PDTS system, D is a “hold” but for shorter duration than P. S is a “movement” without acceleration. T is more abrupt motion.
- 2.
- 3.
References
Agris, U., Zieren, J., Canzler, U., Bauer, B., Kraiss, K.F.: Recent developments in visual sign language recognition. Univ. Access Inf. Soc. 6, 323–362 (2008)
Antonakos, E., Pitsikalis, V., Maragos, P.: Classification of extreme facial events in sign language videos. EURASIP J. Image Video Process. 2014, 14 (2014)
Arbib, M.A.: How the Brain Got Language: The Mirror System Hypothesis. Oxford University Press, New York (2012)
Bayer, I., Silbermann, T.: A multi modal approach to gesture recognition from audio and video data. In: Proceedings of the ACM International Conference on Multimodal Interaction, Sydney, pp. 461–466 (2013)
Bishop, C.M.: Pattern Recognition and Machine Learning. Springer, New York (2006)
Bolt, R.A.: Put-that-there: voice and gesture at the graphics interface. ACM Comput. Graph. 14 (3), 262–270 (1980)
Bordier, C., Puja, F., Macaluso, E.: Sensory processing during viewing of cinematographic material: computational modeling and functional neuroimaging. NeuroImage 67, 213–226 (2013)
Bowden, R., Windridge, D., Kadir, T., Zisserman, A., Brady, M.: A linguistic feature vector for the visual interpretation of sign language. In: Proceedings of the European Conference on Computer Vision (ECCV), Prague (2004)
Buehler, P., Everingham, M., Zisserman, A.: Learning sign language by watching TV (using weakly aligned subtitles). In: Proceedings of the IEEE International Conference on Computer Vision & Pattern Recognition (CVPR), Miami, pp. 2961–2968 (2009)
Chow, Y.-L., Schwartz, R.: The N-best algorithm: an efficient procedure for finding top N sentence hypotheses. In: HLT’89 Proceedings of the Workshop on Speech and Natural Language, Morristown, pp. 199–202 (1989)
Cootes, T.F., Edwards, G.J., Taylor, C.J.: Active appearance models. IEEE Trans. Pattern Anal. Mach. Intell. 23 (6), 681–685 (2001)
Cour, T., Sapp, B., Nagle, A., Taskar, B.: Talking pictures: temporal grouping and dialog-supervised person recognition. In: Proceedings of the IEEE International Conference on Computer Vision & Pattern Recognition (CVPR), San Francisco (2010)
Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: Proceedings of the IEEE International Conference on Computer Vision & Pattern Recognition (CVPR), San Diego, pp. 886–893 (2005)
Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification, 2nd edn. Wiley, New York (2001)
Emmorey, K.: Language, Cognition, and the Brain: Insights from Sign Language Research. Lawrence Erlbaum Associates, Mahwah (2002)
Escalera, S., Gonzàlez, J., Baró, X., Reyes, M., Guyon, I., Athitsos, V., Escalante, H., Sigal, L., Argyros, A., Sminchisescu, C., Bowden, R., Sclaroff, S.: ChaLearn multi-modal gesture recognition 2013: grand challenge and workshop summary. In: Proceedings of the ACM International Conference on Multimodal Interaction, Sydney, pp. 365–368 (2013)
Escalera, S., Gonzàlez, J., Baró, X., Reyes, M., Lopes, O., Guyon, I., Athistos, V., Escalante, H.J.: Multi-modal gesture recognition challenge 2013: dataset and results. In: Proceedings of the ACM International Conference on Multimodal Interaction, pp. 445–452 (2013)
Farhadi, A., Endres, I., Hoiem, D., Forsyth, D.A.: Describing objects by their attributes. In: Proceedings of the IEEE International Conference on Computer Vision & Pattern Recognition (CVPR), Miami (2009)
Fei-Fei, L., Perona, P.: A Bayesian hierarchical model for learning natural scene categories. In: Proceedings of the IEEE International Conference on Computer Vision & Pattern Recognition (CVPR), San Diego (2005)
Gersho, A., Gray, R.M.: Vector Quantization and Signal Compression. Springer Science & Business Media, Boston (1992)
Glotin, H., Vergyr, D., Neti, C., Potamianos, G., Luettin, J.: Weighting schemes for audio-visual fusion in speech recognition. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Salt Lake City, pp. 173–176 (2001)
Jaimes, A., Sebe, N.: Multimodal human–computer interaction: a survey. Comput. Vis. Image Underst. 108 (1), 116–134 (2007)
Jelinek, F.: Statistical Methods for Speech Recognition. MIT Press, Cambridge (1997)
Johnson, R.E., Liddell, S.K.: A segmental framework for representing signs phonetically. Sign Lang. Stud. 11 (3), 408–463 (2011)
Kendon, A.: Gesture: Visible Action as Utterance. Cambridge University Press, Cambridge/New York (2004)
Kopp, S., Bergmann, K.: Automatic and strategic alignment of co-verbal gestures in dialogue. In: Wachsmuth, I., de Ruiter, J., Kopp, S., Jaecks, P. (eds.) Alignment in Communication: Towards a New Theory of Communication, pp. 87–107. John Benjamins Publ. Co., Amsterdam (2013)
Laptev, I., Marszalek, M., Schmid, C., Rozenfeld, B.: Learning realistic human actions from movies. In: Proceedings of the IEEE International Conference on Computer Vision & Pattern Recognition (CVPR), Anchorage (2008)
Liddell, S.K.: Grammar, Gesture and Meaning in American Sign Language. Cambridge University Press, Cambridge (2003)
Maragos, P., Gros, P., Katsamanis, A., Papandreou, G.: Cross-modal integration for performance improving in multimedia: a review. In: Maragos, P., Potamianos, A., Gros, P. (eds.) Multimodal Processing and Interaction: Audio, Video, Text, pp. 3–48. Springer, New York (2008)
McNeill, D.: Gesture: a psycholinguistic approach. In: The Encyclopedia of Language and Linguistics, pp. 1–15. Elsevier, Boston (2006)
McNeill, D.: Gesture-speech unity: phylogenesis, ontogenesis microgenesis. Lang. Interact. Acquis. 5 (2), 137–184 (2014)
Ong, S., Ranganath, S.: Automatic sign language analysis: a survey and the future beyond lexical meaning. IEEE Trans. Pattern Anal. Mach. Intell. 27, 873–891 (2005)
Ostendorf, M., Kannan, A., Austin, S., Kimball, O., Schwartz, R., Rohlicek, J.R.: Integration of diverse recognition methodologies through reevaluation of N-best sentence hypotheses. In: HLT’91 Proceedings of the Workshop on Speech and Natural Language, pp. 83–87 (1991)
Oviatt, S., Cohen, P.: Perceptual user interfaces: multimodal interfaces that process what comes naturally. Commun. ACM 43 (3), 45–53 (2000)
Parikh, D., Grauman, K.: Relative attributes. In: Proceedings of the International Conference on Computer Vision (ICCV), Barcelona (2011)
Pastra, K.: COSMOROE: a cross-media relations framework for modelling multimedia dialectics. Multimed. Syst. 14, 299–323 (2008)
Pavlakos, G., Theodorakis, S., Pitsikalis, V., Katsamanis, A., Maragos, P.: Kinect-based multimodal gesture recognition using a two-pass fusion scheme. In: Proceeding of the IEEE International Conference on Image Processing (ICIP), Paris, pp. 1495–1499 (2014)
Pitsikalis, V., Katsamanis, A., Theodorakis, S., Maragos, P.: Multimodal gesture recognition via multiple hypotheses rescoring. J. Mach. Learn. Res. 16, 255–284 (2015)
Pitsikalis, V., Theodorakis, S., Vogler, C., Maragos, P.: Advances in phonetics-based sub-unit modeling for transcription alignment and sign language recognition. In: Proceedings of the IEEE Conference on Computer Vision & Pattern Recognition Workshops, Colorado Springs (2011)
Rabiner, L.R., Juang, B.H.: Fundamentals of Speech Recognition. Prentice Hall, Englewood Cliffs (1993)
Rose, R.C., Paul, D.B.: A hidden Markov model based keyword recognition system. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Albuquerque, pp. 129–132 (1990)
Searle, J.R.: Mind, Language, and Society: Philosophy in the Real World. Basic Books, New York (1999)
Sivic, J., Russell, B.C., Efros, A.A., Zisserman, A., Freeman, W.T.: Discovering objects and their location in images. In: Proceedings of the International Conference on Computer Vision (ICCV), Beijing, (2005)
Starner, T., Weaver, J., Pentland, A.: Real-time American sign language recognition using desk and wearable computer based video. IEEE Trans. Pattern Anal. Mach. Intell. 20 (12), 1371–1375 (1998)
Theodorakis, S., Pitsikalis, V., Maragos, P.: Dynamic–static unsupervised sequentiality, statistical subunits and lexicon for sign language recognition. Image Vis. Comput. 32, 533–549 (2014)
Theodoridis, S., Koutroumbas, K.: Pattern Recognition, 4th edn. Academic Press (2008)
Tomasello, M.: Origins of Human Communication. MIT Press, Cambridge (2008)
Vatakis, A., Spence, C.: Audiovisual synchrony perception for music, speech, and object actions. Brain Res. 1111, 134–142 (2006)
Vogler, C., Metaxas, D.: A framework for recognizing the simultaneous aspects of American sign language. Comput. Vis. Image Underst. 81 (3), 358–384 (2001)
Wilpon, J., Rabiner, L.R., Lee, C.H., Goldman, E.R.: Automatic recognition of keywords in unconstrained speech using hidden Markov models. IEEE Trans. Acoust. Speech Signal Process. 38 (11), 1870–1878 (1990)
Wittgenstein, L.: Philosophical Investigations. (Translated by Anscombe, G.E.M., and Editors Hacker, P.M.S., Schulte, J., 4th edn.). Wiley-Blackwell Publ. (2009) (1953)
Wittgenstein, L.: The Big Typescript: TS 213 (Edited and translated by Luckhardt, C.G., Aue, M.E.). Blackwell Publication (2005)
Wu, J., Cheng, J., Zhao, C., Lu, H.: Fusing multi-modal features for gesture recognition. In: Proceedings of the ACM International Conference on Multimodal Interaction, Sydney, pp. 453–460 (2013)
Acknowledgements
We wish to thank Niki Efthymiou and Nancy Zlatintsi at NTUA CVSP Lab for Fig. 15.8 and discussions related to Sect. 15.3.2. This research work was supported by the project COGNIMUSE which is implemented under the ARISTEIA Action of the Operational Program Education and Lifelong Learning and is co-funded by the European Social Fund and Greek National Resources. It was also partially supported by the European Union under the project MOBOT with grant FP7-ICT-2011-9 2.1 – 600796.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Maragos, P., Pitsikalis, V., Katsamanis, A., Pavlakos, G., Theodorakis, S. (2016). On Shape Recognition and Language. In: Breuß, M., Bruckstein, A., Maragos, P., Wuhrer, S. (eds) Perspectives in Shape Analysis. Mathematics and Visualization. Springer, Cham. https://doi.org/10.1007/978-3-319-24726-7_15
Download citation
DOI: https://doi.org/10.1007/978-3-319-24726-7_15
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-24724-3
Online ISBN: 978-3-319-24726-7
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)