Analysis and Synthesis of Multimodal Verbal and Non-verbal Interaction for Animated Interface Agents

Beskow, Jonas; Granström, Björn; House, David

doi:10.1007/978-3-540-76442-7_22

Jonas Beskow¹,
Björn Granström¹ &
David House¹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 4775))

2546 Accesses

Abstract

The use of animated talking agents is a novel feature of many multimodal spoken dialogue systems. The addition and integration of a virtual talking head has direct implications for the way in which users approach and interact with such systems. However, understanding the interactions between visual expressions, dialogue functions and the acoustics of the corresponding speech presents a substantial challenge. Some of the visual articulation is closely related to the speech acoustics, while there are other articulatory movements affecting speech acoustics that are not visible on the outside of the face. Many facial gestures used for communicative purposes do not affect the acoustics directly, but might nevertheless be connected on a higher communicative level in which the timing of the gestures could play an important role. This chapter looks into the communicative function of the animated talking agent, and its effect on intelligibility and the flow of the dialogue.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Cross Modal Evaluation of High Quality Emotional Speech Synthesis with the Virtual Human Toolkit

Transforming an embodied conversational agent into an efficient talking head: from keyframe-based animation to multimodal concatenation synthesis

Article Open access 08 September 2015

Expressivity in Interactive Speech Synthesis; Some Paralinguistic and Nonlinguistic Issues of Speech Prosody for Conversational Dialogue Systems

References

Cassell, J., Bickmore, T., Campbell, L., Hannes, V., Yan, H.: Conversation as a System Framework: Designing Embodied Conversational Agents. In: Cassell, J., Sullivan, J., Prevost, S., Churchill, E. (eds.) Embodied Conversational Agents, pp. 29–63. The MIT Press, Cambridge, MA (2000)
Google Scholar
Massaro, D.W.: Perceiving Talking Faces: From Speech Perception to a Behavioural Principle. The MIT Press, Cambridge, MA (1998)
Google Scholar
Bickmore, T., Cassell, J.: Social Dialogue with Embodied Conversational Agents. In: van Kuppevelt, J., Dybkjaer, L., Bernsen, N.O. (eds.) Advances in Natural Multimodal Dialogue Systems, pp. 23–54. Springer, Dordrecht, The Netherlands (2005)
Chapter Google Scholar
Granström, B., House, D., Beskow, J., Lundeberg, M.: Verbal and visual prosody in multimodal speech perception. In: von Dommelen, W., Fretheim, T. (eds.) Nordic Prosody: Proc. of the VIIIth Conference, Trondheim 2000. Frankfurt am Main: Peter Lang, pp. 77–88 (2001)
Google Scholar
Carlson, R., Granström, B.: Speech Synthesis. In: Hardcastle, W., Laver, J. (eds.) The Handbook of Phonetic Sciences, pp. 768–788. Blackwell Publishers Ltd., Oxford (1997)
Google Scholar
Beskow, J.: Rule-based Visual Speech Synthesis. In: Proceedings of Eurospeech 1995, Madrid, Spain, pp. 299–302 (1995)
Google Scholar
Beskow, J.: Animation of Talking Agents. In: Proceedings of AVSP 1997, ESCA Workshop on Audio-Visual Speech Processing, Rhodes, Greece, pp. 149–152 (1997)
Google Scholar
Parke, F.I.: Parameterized models for facial animation. IEEE Computer Graphics 2(9), 61–68 (1982)
Article Google Scholar
Engwall, O.: Combining MRI, EMA and EPG measurements in a three-dimensional tongue model. Speech Communication 41(2-3), 303–329 (2003)
Article Google Scholar
Cole, R., Massaro, D.W., de Villiers, J., Rundle, B., Shobaki, K., Wouters, J., Cohen, M., Beskow, J., Stone, P., Connors, P., Tarachow, A., Solcher, D.: New tools for interactive speech and language training: Using animated conversational agents in the classrooms of profoundly deaf children. In: MATISSE. Proceedings of ESCA/Socrates Workshop on Method and Tool Innovations for Speech Science Education, pp. 45–52. University College London, London (1999)
Google Scholar
Massaro, D.W., Bosseler, A., Light, J.: Development and Evaluation of a Computer-Animated Tutor for Language and Vocabulary Learning. In: ICPhS 2003. 15th International Congress of Phonetic Sciences, Barcelona, Spain, pp. 143–146 (2003)
Google Scholar
Massaro, D.W., Light, J.: Read My Tongue Movements: Bimodal Learning To Perceive And Produce Non-Native Speech /r/ and /l/. In: Eurospeech 2003, Geneva, Switzerland, pp. 2249–2252 (2003)
Google Scholar
Engwall, O., Bälter, O., Öster, A.-M., Kjellström, H.: Designing the user interface of the computer-based speech training system ARTUR based on early user tests. Journal of Behavioural and Information Technology 25(4), 353–365 (2006)
Article Google Scholar
Sjölander, K., Beskow, J.: WaveSurfer - an Open Source Speech Tool. In: Proc of ICSLP 2000, Beijing, vol. 4, pp. 464–467 (2000)
Google Scholar
Beskow, J., Karlsson, I., Kewley, J., Salvi, G.: SYNFACE: - A talking head telephone for the hearing-impaired. In: Miesenberger, K., Klaus, J., Zagler, W., Burger, D. (eds.) Computers Helping People with Special Needs, pp. 1178–1186. Springer, Heidelberg (2004)
Google Scholar
Engwall, O., Wik, P., Beskow, J., Granström, G.: Design strategies for a virtual language tutor. In: Kim, S.H.Y. (ed.) Proc ICSLP 2004, Jeju Island, Korea, pp. 1693–1696 (2004)
Google Scholar
Nordstrand, M., Svanfeldt, G., Granström, B., House, D.: Measurements of articulatory variation in expressive speech for a set of Swedish vowels. Speech Communication 44, 187–196 (2004)
Article Google Scholar
Cohen, M.M., Massaro, D.W.: Modelling Coarticulation in Synthetic Visual Speech. In: Magnenat-Thalmann, N., Thalmann, D. (eds.) Models and Techniques in Computer Animation, pp. 139–156. Springer, Tokyo (1993)
Google Scholar
Beskow, J., Nordenberg, M.: Data-driven Synthesis of Expressive Visual Speech using an MPEG-4 Talking Head. In: Beskow, J., Nordenberg, M. (eds.) Proceedings of INTERSPEECH 2005, Lisbon, Portugal, pp. 793–796 (2005)
Google Scholar
Beskow, J., Engwall, O., and Granström, B.: Resynthesis of Facial and Intraoral Articulation from Simultaneous Measurements. In: Solé, M.J., D. R., Romero, J.(eds.) Proceedings of the 15th ICPhS, Barcelona, Spain, pp: 431–434 (2003)
Google Scholar
Beskow, J., Granström, B., House, D.: Visual correlates to prominence in several expressive modes. In: Proceedings of Interspeech 2006, Pittsburg, PA, pp. 1272–1275 (2006)
Google Scholar
Pandzic, I.S., Forchheimer, R.: MPEG Facial animation – the standard, implementation and applications. John Wiley & Sons, Chichester, England (2002)
Google Scholar
Beskow, J., Cerrato, L., Cosi, P., Costantini, E., Nordstrand, M., Pianesi, F., Prete, M., Svanfeldt, G.: Preliminary Cross-cultural Evaluation of Expressiveness in Synthetic Faces. In: Proc. Affective Dialogue Systems (ADS) 2004, Kloster Irsee, Germany, pp. 301–304 (2004)
Google Scholar
Beskow, J., Cerrato, L., Granström, B., House, D., Nordenberg, M., Nordstrand, M., Svanfeldt, G.: Expressive Animated Agents for Affective Dialogue Systems. In: Proc. Affective Dialogue Systems (ADS) 2004, Kloster Irsee, Germany, pp. 240–243 (2004)
Google Scholar
Agelfors, E., Beskow, J., Dahlquist, M., Granström, B., Lundeberg, M., Spens, K.-E., Öhman, T.: Synthetic faces as a lipreading support. In: Proceedings of ICSLP 1998, Sydney, Australia, pp. 3047–3050 (1998)
Google Scholar
Beskow, J., Granström, B., Spens, K.-E.: Articulation strength - Readability experiments with a synthetic talking face. TMH-QPSR, vol. 44, KTH, Stockholm, pp. 97–100 (2002)
Google Scholar
Westervelt, P.J.: Parametric acoustic array. J. Acoust. Soc. Amer. 35, 535–537 (1963)
Article Google Scholar
Svanfeldt, G., Olszewski, D.: Perception experiment combining a parametric loudspeaker and a synthetic talking head. In: Svanfeldt, G., Olszewski, D. (eds.) Proceedings of INTERSPEECH 2005, Lisbon, Portugal, pp. 1721–1724 (2005)
Google Scholar
Bruce, G., Granström, B., House, D.: Prosodic phrasing in Swedish speech synthesis. In: Bailly, G., Benoit, C., Sawallis, T.R. (eds.) Talking Machines: Theories, Models, and Designs, pp. 113–125. North Holland, Amsterdam (1992)
Google Scholar
House, D., Beskow, J., Granström, B.: Timing and interaction of visual cues for prominence in audiovisual speech perception. In: Proc. Eurospeech 2001, Aalborg, Denmark, pp. 387–390 (2001)
Google Scholar
Keating, P., Baroni, M., Mattys, S., Scarborough, R., Alwan, A., Auer, E., Bernstein, L.: Optical Phonetics and Visual Perception of Lexical and Phrasal Stress in English. In: Proc. 15th International Congress of Phonetic Sciences, pp. 2071–2074 (2003)
Google Scholar
Dohen, M.: Deixis prosodique multisensorielle: Production et perception audiovisuelle de la focalisation contrastive en Français. PhD thesis, Institut de la Communication Parlée, Grenoble (2005)
Google Scholar
Beskow, J., Cerrato, L., Granström, B., House, D., Nordstrand, M., Svanfeldt, G.: The Swedish PF-Star Multimodal Corpora. In: Proc. LREC Workshop, Multimodal Corpora: Models of Human Behaviour for the Specification and Evaluation of Multimodal Input and Output Interfaces, Lisbon, Portugal, pp. 34–37 (2004)
Google Scholar
Bell, L., Gustafson, J.: Interacting with an animated agent: an analysis of a Swedish database of spontaneous computer directed speech. In: Proc of Eurospeech 1999, pp. 1143–1146 (1999)
Google Scholar
House, D.: Phrase-final rises as a prosodic feature in wh-questions in Swedish human–machine dialogue. Speech Communication 46, 268–283 (2005)
Article Google Scholar
Granström, B., House, D., Swerts, M.G.: Multimodal feedback cues in human-machine interactions. In: Bernard, B., Isabelle, M. (eds.) Proceedings of the Speech Prosody 2002 Conference, pp. 347–350. Aix-en-Provence, Laboratoire Parole et Langage (2002)
Google Scholar

Download references

Author information

Authors and Affiliations

Centre for Speech Technology, CSC, KTH, Stockholm, Sweden
Jonas Beskow, Björn Granström & David House

Authors

Jonas Beskow
View author publications
You can also search for this author in PubMed Google Scholar
Björn Granström
View author publications
You can also search for this author in PubMed Google Scholar
David House
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Anna Esposito Marcos Faundez-Zanuy Eric Keller Maria Marinaro

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Beskow, J., Granström, B., House, D. (2007). Analysis and Synthesis of Multimodal Verbal and Non-verbal Interaction for Animated Interface Agents. In: Esposito, A., Faundez-Zanuy, M., Keller, E., Marinaro, M. (eds) Verbal and Nonverbal Communication Behaviours. Lecture Notes in Computer Science(), vol 4775. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-76442-7_22

Download citation

DOI: https://doi.org/10.1007/978-3-540-76442-7_22
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-76441-0
Online ISBN: 978-3-540-76442-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Analysis and Synthesis of Multimodal Verbal and Non-verbal Interaction for Animated Interface Agents

Abstract

Access this chapter

Preview

Similar content being viewed by others

Cross Modal Evaluation of High Quality Emotional Speech Synthesis with the Virtual Human Toolkit

Transforming an embodied conversational agent into an efficient talking head: from keyframe-based animation to multimodal concatenation synthesis

Expressivity in Interactive Speech Synthesis; Some Paralinguistic and Nonlinguistic Issues of Speech Prosody for Conversational Dialogue Systems

References

Author information

Authors and Affiliations

Editor information

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Analysis and Synthesis of Multimodal Verbal and Non-verbal Interaction for Animated Interface Agents

Abstract

Access this chapter

Preview

Similar content being viewed by others

Cross Modal Evaluation of High Quality Emotional Speech Synthesis with the Virtual Human Toolkit

Transforming an embodied conversational agent into an efficient talking head: from keyframe-based animation to multimodal concatenation synthesis

Expressivity in Interactive Speech Synthesis; Some Paralinguistic and Nonlinguistic Issues of Speech Prosody for Conversational Dialogue Systems

References

Author information

Authors and Affiliations

Editor information

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation