Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3570945.3607289acmconferencesArticle/Chapter ViewAbstractPublication PagesivaConference Proceedingsconference-collections
research-article
Open access

Generation of speech and facial animation with controllable articulatory effort for amusing conversational characters

Published: 22 December 2023 Publication History

Abstract

Engaging embodied conversational agents need to generate expressive behavior in order to be believable in socializing interactions. We present a system that can generate spontaneous speech with supporting lip movements. The neural conversational TTS voice is trained on a multi-style speech corpus that has been prosodically tagged (pitch and speaking rate) and transcribed (including tokens for breathing, fillers and laughter). We introduce a speech animation algorithm where articulatory effort can be adjusted. The facial animation is driven by time-stamped phonemes and prominence estimates from the synthesised speech waveform to modulate the lip-and jaw movements accordingly. In objective evaluations we show that the system is able to generate speech and facial animation that vary in articulation effort. In subjective evaluations we compare our conversational TTS system's capability to deliver jokes with a commercial TTS. Both system succeeded equally good.

References

[1]
Elisabeth André, Thomas Rist, and Jochen Muller. 1999. Employing AI methods to control the behavior of animated interface agents. Applied Artificial Intelligence 13, 4-5 (1999), 415--448.
[2]
Matthew P Aylett, Benjamin R Cowan, and Leigh Clark. 2019. Siri, Echo and performance: You have to suffer darling. In Extended Abstracts of the 2019 CHI Conference on Human Factors in Computing Systems. 1--10.
[3]
RS Aylett, S Louchart, J Dias, A Paiva, and M Vala. 2005. FearNot!--an experiment in emergent narrative. In Proc. IVA. 305--316.
[4]
Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. 2020. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in Neural Information Processing Systems 33 (2020), 12449--12460.
[5]
Linda Bell and Joakim Gustafson. 2003. Child and Adult Speaker Adaptation during Error Resolution in a Publicly Available Spoken Dialogue System. In Proc. Eurospeech. 613--616.
[6]
Elisabetta Bevacqua, Ken Prepin, Radoslaw Niewiadomski, Etienne de Sevin, and Catherine Pelachaud. [n. d.]. Greta: Towards an interactive conversational virtual companion. ([n. d.]).
[7]
Timothy W Bickmore, Lisa Caruso, Kerri Clough-Gorr, and Tim Heeren. 2005. 'It's just like you talk to a friend'relational agents for older adults. Interacting with Computers 17, 6 (2005), 711--735.
[8]
Justine Cassell. 1999. Embodiment in conversational interfaces: Rea. In Proc. SIGCHI conference on Human Factors in Computing Systems. 520--527.
[9]
Justine Cassell. 2001. Embodied conversational agents: representation and intelligence in user interfaces. AI magazine 22, 4 (2001), 67--67.
[10]
Li-Wei Chen, Shinji Watanabe, and Alexander Rudnicky. 2023. A Vector Quantized Approach for Text to Speech Synthesis on Real-World Spontaneous Speech. arXiv preprint arXiv:2302.04215 (2023).
[11]
Michelle Cohn and Georgia Zellou. 2020. Perception of concatenative vs. neural text-to-speech (TTS): Differences in intelligibility in noise and language attitudes. In Proc. Interspeech.
[12]
Erica Cooper, Wen-Chin Huang, Tomoki Toda, and Junichi Yamagishi. 2022. Generalization ability of MOS prediction networks. In Proc. ICASSP. 8442--8446.
[13]
Jens Edlund, Joakim Gustafson, Mattias Heldner, and Anna Hjalmarsson. 2008. Towards human-like spoken dialogue systems. Speech communication 50, 8-9 (2008), 630--645.
[14]
Pif Edwards, Chris Landreth, Eugene Fiume, and Karan Singh. 2016. Jali: an animator-centric viseme model for expressive lip synchronization. ACM Transactions on Graphics (TOG) 35, 4 (2016), 1--11.
[15]
Yudong Guo, Keyu Chen, Sen Liang, Yong-Jin Liu, Hujun Bao, and Juyong Zhang. 2021. Ad-nerf: Audio driven neural radiance fields for talking head synthesis. In Proc. IEEE Int. Conference on Computer Vision. 5784--5794.
[16]
Joakim Gustafson, Johan Boye, Morgan Fredriksson, Lasse Johanneson, and Jürgen Königsmann. 2005. Providing computer game characters with conversational abilities. In Proc. IVA. 37--51.
[17]
Joakim Gustafson, Nikolaj Lindberg, and Magnus Lundeberg. 1999. The August spoken dialogue system. In Proc. Eurospeech.
[18]
Joakim Gustafson and Kåre Sjölander. 2004. Voice creation for conversational fairy-tale characters. In Proc. SSW.
[19]
Mohammed Hoque, Matthieu Courgeon, Jean-Claude Martin, Bilge Mutlu, and Rosalind W Picard. 2013. Mach: My automated conversation coach. In Proc. UbiComp. 697--706.
[20]
Ahmed Hussen Abdelaziz, Anushree Prasanna Kumar, Chloe Seivwright, Gabriele Fanelli, Justin Binder, Yannis Stylianou, and Sachin Kajareker. 2021. Audiovisual Speech Synthesis using Tacotron2. In Proc. ICMI. 503--511.
[21]
Patrik Jonell, Mattias Bystedt, Per Fallgren, Dimosthenis Kontogiorgos, José Lopes, Zofia Malisz, Samuel Mascarenhas, Catharine Oertel, Eran Raveh, and Todd Shore. 2018. Farmi: a framework for recording multi-modal interactions. In Proc. LREC).
[22]
Heather Knight, Scott Satkin, Varun Ramakrishna, and Santosh Divvala. 2011. A savvy robot standup comic: Online learning through audience tracking. In Workshop paper (TEI'10).
[23]
John Kominek and Alan W Black. 2004. The CMU Arctic speech databases. In Proc. SSW.
[24]
Jungil Kong, Jaehyeon Kim, and Jaekyoung Bae. 2020. HiFi-GAN: Generative adversarial networks for efficient and high fidelity speech synthesis. In Proc. NeurIPS, Vol. 33. 17022--17033.
[25]
Dimosthenis Kontogiorgos, Vanya Avramova, Simon Alexanderson, Patrik Jonell, Catharine Oertel, Jonas Beskow, Gabriel Skantze, and Joakim Gustafson. 2018. A multimodal corpus for mutual gaze and joint attention in multiparty situated interaction. In Proc. LREC. 119--127.
[26]
Stefan Kopp, Lars Gesellensetter, Nicole C Krämer, and Ipke Wachsmuth. 2005. A conversational agent as museum guide--design and evaluation of a real-world application. In Proc. IVA. 329--343.
[27]
Katharina Kühne, Martin H Fischer, and Yuefang Zhou. 2020. The human takes it all: Humanlike synthesized voices are perceived as less eerie and more likable. evidence from a subjective ratings study. Frontiers in Neurorobotics (2020), 105.
[28]
Harm Lameris, Shivam Mehta, Gustav Eje Henter, Joakim Gustafson, and Éva Székely. 2023. Prosody-controllable spontaneous TTS with neural HMMs. Proc. ICASSP (2023).
[29]
Yinghao Aaron Li, Cong Han, and Nima Mesgarani. 2022. StyleTTS: A Style-Based Generative Model for Natural and Diverse Text-to-Speech Synthesis. arXiv preprint arXiv:2205.15439 (2022).
[30]
Björn Lindblom. 1983. Economy of speech gestures. In The production of speech. Springer, 217--245.
[31]
Matthew Marge, Carol Espy-Wilson, Nigel G Ward, Abeer Alwan, Yoav Artzi, Mohit Bansal, Gil Blankenship, Joyce Chai, Hal Daumé III, Debadeepta Dey, et al. 2022. Spoken language interaction with robots: Recommendations for future research. Computer Speech & Language 71 (2022), 101255.
[32]
Sven EG Öhman. 1967. Numerical model of coarticulation. The Journal of the Acoustical Society of America 41, 2 (1967), 310--320.
[33]
Stefan Olafsson, Teresa K O'Leary, and Timothy W Bickmore. 2020. Motivating health behavior change with humorous virtual agents. In Proc. IVA. 1--8.
[34]
Raquel Oliveira, Patricia Arriaga, Minja Axelsson, and Ana Paiva. 2021. Humor--Robot interaction: a scoping review of the literature and future directions. International Journal of Social Robotics 13 (2021), 1369--1383.
[35]
OpenAI. 2023. GPT-4 Technical Report. arXiv:2303.08774 [cs.CL]
[36]
Ye Pan, Ruisi Zhang, Shengran Cheng, Shuai Tan, Yu Ding, Kenny Mitchell, and Xubo Yang. 2023. Emotional Voice Puppetry. IEEE Transactions on Visualization and Computer Graphics 29, 5 (2023), 2527--2535.
[37]
Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. 2022. Robust speech recognition via large-scale weak supervision. arXiv preprint arXiv:2212.04356 (2022).
[38]
Tuomo Raitio, Ramya Rasipuram, and Dan Castellani. 2020. Controllable Neural Text-to-Speech Synthesis Using Intuitive Prosodic Features. Proc. Interspeech (2020), 4432--4436.
[39]
Jeff Rickel and W Lewis Johnson. 1999. Animated agents for procedural training in virtual reality: Perception, cognition, and motor control. Applied Artificial Intelligence 13, 4-5 (1999), 343--382.
[40]
Pejman Sajjadi, Laura Hoffmann, Philipp Cimiano, and Stefan Kopp. 2018. On the effect of a personality-driven ECA on perceived social presence and game experience in vr. In Proc. VS-Games. 1--8.
[41]
Katie Seaborn, Norihisa P Miyake, Peter Pennefather, and Mihoko Otake-Matsuura. 2021. Voice in human--agent interaction: a survey. ACM Computing Surveys (CSUR) 54, 4 (2021), 1--43.
[42]
Jonathan Shen, Ruoming Pang, Ron J Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, Rj Skerrv-Ryan, et al. 2018. Natural TTS synthesis by conditioning wavenet on mel spectrogram predictions. In Proc. ICASSP. 4779--4783.
[43]
Antti Suni, Juraj Šimko, Daniel Aalto, and Martti Vainio. 2017. Hierarchical representation and estimation of prosody using continuous wavelet transform. Computer Speech & Language 45 (2017), 123--136.
[44]
William Swartout, David Traum, Ron Artstein, Dan Noren, Paul Debevec, Kerry Bronnenkant, Josh Williams, Anton Leuski, Shrikanth Narayanan, Diane Piepol, et al. 2010. Ada and Grace: Toward realistic and engaging virtual museum guides. In Proc. IVA. 286--300.
[45]
Éva Székely, Jens Edlund, and Joakim Gustafson. 2020. Augmented Prompt Selection for Evaluation of Spontaneous Speech Synthesis. In Proc. LREC. 6368--6374.
[46]
Éva Székely, Gustav Eje Henter, and Joakim Gustafson. 2019. Casting to corpus: Segmenting and selecting spontaneous dialogue for TTS with a CNN-LSTM speaker-dependent breath detector. In Proc. ICASSP. 6925--6929.
[47]
Éva Székely, Siyang Wang, and Joakim Gustafson. 2023. So-to-Speak: an exploratory platform for investigating the interplay between style and prosody in TTS. In Proc. Interspeech.
[48]
Sarah Taylor, Taehwan Kim, Yisong Yue, Moshe Mahler, James Krahe, Anastasio Garcia Rodriguez, Jessica Hodgins, and Iain Matthews. 2017. A deep learning approach for generalized speech animation. ACM Transactions on Graphics (TOG) 36, 4 (2017), 1--11.
[49]
Sean Thomas, Ylva Ferstl, Rachel McDonnell, and Cathy Ennis. 2022. Investigating how speech and animation realism influence the perceived personality of virtual characters and agents. In Proc. IEEE Conference on Virtual Reality and 3D User Interfaces (VR). 11--20.
[50]
Kristinn R Thórisson. 1997. Gandalf: an embodied humanoid capable of real-time multimodal dialogue with people. In Proc. of the First International Conference on Autonomous Agents. 536--537.
[51]
Guanzhong Tian, Yi Yuan, and Yong Liu. 2019. Audio2face: Generating speech/face animation from single audio with attention-based bidirectional lstm networks. In Proc. ICMEW. IEEE, 366--371.
[52]
Christiana Tsiourti, Emilie Joly, Cindy Wings, Maher Ben Moussa, and Katarzyna Wac. 2014. Virtual assistive companions for older adults: qualitative field study and design implications. In Proc. PervasiveHealth. 57--64.
[53]
Se-Yun Um, Sangshin Oh, Kyungguen Byun, Inseon Jang, ChungHyun Ahn, and Hong-Goo Kang. 2020. Emotional speech synthesis with rich and granularized control. In Proc. ICASSP. 7254--7258.
[54]
Rafael Valle, Jason Li, Ryan Prenger, and Bryan Catanzaro. 2020. Mellotron: Multispeaker expressive voice synthesis by conditioning on rhythm, pitch and global style tokens. In Proc. ICASSP.
[55]
Laurens Van Der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of Machine Learning Research 9 (2008), 2579--2605.
[56]
Petra Wagner, Jonas Beskow, Simon Betz, Jens Edlund, Joakim Gustafson, Gustav Eje Henter, Sébastien Le Maguer, Zofia Malisz, Éva Székely, Christina Tånnander, et al. 2019. Speech synthesis evaluation---state-of-the-art assessment and suggestion for a novel research program. In Proc. SSW.
[57]
Siyang Wang, Simon Alexanderson, Joakim Gustafson, Jonas Beskow, Gustav Eje Henter, and Éva Székely. 2021. Integrated Speech and Gesture Synthesis. In Proc. ICMI. 177--185.
[58]
Yuxuan Wang, Daisy Stanton, Yu Zhang, RJ-Skerry Ryan, Eric Battenberg, Joel Shor, Ying Xiao, Ye Jia, Fei Ren, and Rif A Saurous. 2018. Style tokens: Unsupervised style modeling, control and transfer in end-to-end speech synthesis. In International Conference on Machine Learning. 5180--5189.
[59]
Nigel G Ward. 2019. Prosodic patterns in English conversation. Cambridge University Press.
[60]
Preben Wik and Anna Hjalmarsson. 2009. Embodied conversational agents in computer assisted language learning. Speech Communication 51, 10 (2009), 1024--1037.
[61]
Rohola Zandie, Mohammad H Mahoor, Julia Madsen, and Eshrat S Emamian. 2021. RyanSpeech: A Corpus for Conversational Text-to-Speech Synthesis. In Proc. Interspeech. 2751--2755.
[62]
Hongbo Zhu, Chuang Yu, and Angelo Cangelosi. 2023. Affective Human-Robot Interaction with Multimodal Explanations. In Proc. ICSR. Springer, 241--252.

Index Terms

  1. Generation of speech and facial animation with controllable articulatory effort for amusing conversational characters

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      IVA '23: Proceedings of the 23rd ACM International Conference on Intelligent Virtual Agents
      September 2023
      376 pages
      ISBN:9781450399944
      DOI:10.1145/3570945
      This work is licensed under a Creative Commons Attribution International 4.0 License.

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 22 December 2023

      Check for updates

      Author Tags

      1. ECAs
      2. facial animation
      3. humour generation
      4. speech synthesis

      Qualifiers

      • Research-article
      • Research
      • Refereed limited

      Funding Sources

      Conference

      IVA '23
      Sponsor:

      Acceptance Rates

      Overall Acceptance Rate 53 of 196 submissions, 27%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • 0
        Total Citations
      • 264
        Total Downloads
      • Downloads (Last 12 months)264
      • Downloads (Last 6 weeks)34
      Reflects downloads up to 25 Dec 2024

      Other Metrics

      Citations

      View Options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Login options

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media