Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Speech gesture generation from the trimodal context of text, audio, and speaker identity

Published: 27 November 2020 Publication History

Abstract

For human-like agents, including virtual avatars and social robots, making proper gestures while speaking is crucial in human-agent interaction. Co-speech gestures enhance interaction experiences and make the agents look alive. However, it is difficult to generate human-like gestures due to the lack of understanding of how people gesture. Data-driven approaches attempt to learn gesticulation skills from human demonstrations, but the ambiguous and individual nature of gestures hinders learning. In this paper, we present an automatic gesture generation model that uses the multimodal context of speech text, audio, and speaker identity to reliably generate gestures. By incorporating a multimodal context and an adversarial training scheme, the proposed model outputs gestures that are human-like and that match with speech content and rhythm. We also introduce a new quantitative evaluation metric for gesture generation models. Experiments with the introduced metric and subjective human evaluation showed that the proposed gesture generation model is better than existing end-to-end generation models. We further confirm that our model is able to work with synthesized audio in a scenario where contexts are constrained, and show that different gesture styles can be generated for the same speech by specifying different speaker identities in the style embedding space that is learned from videos of various speakers. All the code and data is available at https://github.com/ai4r/Gesture-Generation-from-Trimodal-Context.

Supplementary Material

MP4 File (a222-yoon.mp4)
MP4 File (3414685.3417838.mp4)
Presentation video

References

[1]
Chaitanya Ahuja and Louis-Philippe Morency. 2019. Language2Pose: Natural Language Grounded Pose Forecasting. In International Conference on 3D Vision. IEEE, 719--728.
[2]
Simon Alexanderson, Gustav Eje Henter, Taras Kucherenko, and Jonas Beskow. 2020. Style-Controllable Speech-Driven Gesture Synthesis Using Normalising Flows. In Computer Graphics Forum, Vol. 39. Wiley Online Library, 487--496.
[3]
Andreas Aristidou, Efstathios Stavrakis, Panayiotis Charalambous, Yiorgos Chrysanthou, and Stephania Loizidou Himona. 2015. Folk Dance Evaluation Using Laban Movement Analysis. Journal on Computing and Cultural Heritage 8, 4 (2015), 20.
[4]
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural Machine Translation by Jointly Learning to Align and Translate. International Conference on Learning Representations.
[5]
Shaojie Bai, J. Zico Kolter, and Vladlen Koltun. 2018. An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling. arXiv:1803.01271 (2018).
[6]
Tadas Baltrušaitis, Chaitanya Ahuja, and Louis-Philippe Morency. 2018. Multimodal Machine Learning: A Survey and Taxonomy. IEEE Transactions on Pattern Analysis and Machine Intelligence 41, 2 (2018), 423--443.
[7]
Kirsten Bergmann, Volkan Aksu, and Stefan Kopp. 2011. The Relation of Speech and Gestures: Temporal Synchrony Follows Semantic Synchrony. In Proceedings of the 2nd Workshop on Gesture and Speech in Interaction.
[8]
Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching Word Vectors with Subword Information. Transactions of the Association for Computational Linguistics 5 (2017), 135--146.
[9]
Ali Borji. 2019. Pros and Cons of GAN Evaluation Measures. Computer Vision and Image Understanding 179 (2019), 41--65.
[10]
Paul Bremner, Anthony G Pipe, Chris Melhuish, Mike Fraser, and Sriram Subramanian. 2011. The Effects of Robot-Performed Co-Verbal Gesture on Listener Behaviour. In IEEE-RAS International Conference on Humanoid Robots. IEEE, 458--465.
[11]
Judee K Burgoon, Thomas Birk, and Michael Pfau. 1990. Nonverbal Behaviors, Persuasion, and Credibility. Human communication research 17, 1 (1990), 140--169.
[12]
Justine Cassell, Hannes Högni Vilhjálmsson, and Timothy Bickmore. 2004. BEAT: the Behavior Expression Animation Toolkit. In Life-Like Characters. Springer, 163--185.
[13]
Chung-Cheng Chiu, Louis-Philippe Morency, and Stacy Marsella. 2015. Predicting Co-verbal Gestures: A Deep and Temporal Modeling Approach. In ACM International Conference on Intelligent Virtual Agents. Springer, 152--166.
[14]
Kyunghyun Cho, Bart van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. In Empirical Methods in Natural Language Processing. 1724--1734.
[15]
Mingyuan Chu and Peter Hagoort. 2014. Synchronization of Speech and Gesture: Evidence for Interaction in Action. Journal of Experimental Psychology: General 143, 4 (2014), 1726.
[16]
Wei Chu and Zoubin Ghahramani. 2005. Extensions of Gaussian Processes for Ranking: Semi-supervised and Active Learning. Learning to Rank (2005), 29.
[17]
Andrew P Clark, Kate L Howard, Andy T Woods, Ian S Penton-Voak, and Christof Neumann. 2018. Why Rate When You Could Compare? Using the "EloChoice" Package to Assess Pairwise Comparisons of Perceived Physical Strength. PloS one 13, 1 (2018).
[18]
Ylva Ferstl, Michael Neff, and Rachel McDonnell. 2019. Multi-Objective Adversarial Gesture Generation. In Motion, Interaction and Games. 1--10.
[19]
Peng Fu, Zheng Lin, Fengcheng Yuan, Weiping Wang, and Dan Meng. 2018. Learning Sentiment-Specific Word Embedding via Global Sentiment Representation. In AAAI Conference on Artificial Intelligence.
[20]
Shiry Ginosar, Amir Bar, Gefen Kohavi, Caroline Chan, Andrew Owens, and Jitendra Malik. 2019. Learning Individual Styles of Conversational Gesture. In IEEE Conference on Computer Vision and Pattern Recognition. 3497--3506.
[21]
Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative Adversarial Nets. In Advances in Neural Information Processing Systems. 2672--2680.
[22]
Google. 2018. Google Cloud Text-to-Speech. https://cloud.google.com/text-to-speech Accessed: 2020-03-01.
[23]
Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. 2017. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. In Advances in Neural Information Processing Systems. 6626--6637.
[24]
Autumn B Hostetter and Andrea L Potthoff. 2012. Effects of Personality and Social Situation on Representational Gesture Production. Gesture 12, 1 (2012), 62--83.
[25]
Chien-Ming Huang and Bilge Mutlu. 2014. Learning-Based Modeling of Multimodal Behaviors for Humanlike Robots. In ACM/IEEE International Conference on Human-Robot Interaction. ACM, 57--64.
[26]
Peter J Huber. 1964. Robust Estimation of a Location Parameter. The Annals of Mathematical Statistics 35, 1 (1964), 73--101.
[27]
Catalin Ionescu, Dragos Papava, Vlad Olaru, and Cristian Sminchisescu. 2013. Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments. IEEE Transactions on Pattern Analysis and Machine Intelligence 36, 7 (2013), 1325--1339.
[28]
Ali Jahanian, Lucy Chai, and Phillip Isola. 2020. On the "Steerability" of Generative Adversarial Networks. In International Conference on Learning Representations.
[29]
Hanbyul Joo, Tomas Simon, Mina Cikara, and Yaser Sheikh. 2019. Towards Social Artificial Intelligence: Nonverbal Social Signal Prediction in A Triadic Interaction. In IEEE Conference on Computer Vision and Pattern Recognition. 10873--10883.
[30]
Kevin Kilgour, Mauricio Zuluaga, Dominik Roblek, and Matthew Sharifi. 2018. Fréchet Audio Distance: A Metric for Evaluating Music Enhancement Algorithms. arXiv preprint arXiv:1812.08466 (2018).
[31]
Taewoo Kim and Joo-Haeng Lee. 2020. C-3PO: Cyclic-Three-Phase Optimization for Human-Robot Motion Retargeting based on Reinforcement Learning. In International Conference on Robotics and Automation.
[32]
Diederik P Kingma and Max Welling. 2014. Auto-Encoding Variational Bayes. In International Conference on Learning Representations.
[33]
Michael Kipp. 2005. Gesture Generation by Imitation: From Human Behavior to Computer Character Animation. Universal-Publishers.
[34]
Sotaro Kita. 2000. How Representational Gestures Help Speaking. Language and gesture 1 (2000), 162--185.
[35]
Stefan Kopp, Brigitte Krenn, Stacy Marsella, Andrew N Marshall, Catherine Pelachaud, Hannes Pirker, Kristinn R Thórisson, and Hannes Vilhjálmsson. 2006. Towards a Common Framework for Multimodal Generation: The Behavior Markup Language. In ACM International Conference on Intelligent Virtual Agents. Springer, 205--217.
[36]
Taras Kucherenko, Dai Hasegawa, Gustav Eje Henter, Naoshi Kaneko, and Hedvig Kjellström. 2019. Analyzing Input and Output Representations for Speech-Driven Gesture Generation. In ACM International Conference on Intelligent Virtual Agents. 97--104.
[37]
Taras Kucherenko, Patrik Jonell, Sanne van Waveren, Gustav Eje Henter, Simon Alexanderson, Iolanda Leite, and Hedvig Kjellström. 2020. Gesticulator: A Framework for Semantically-Aware Speech-Driven Gesture Generation. In ACM International Conference on Multimodal Interaction.
[38]
Sergey Levine, Philipp Krähenbühl, Sebastian Thrun, and Vladlen Koltun. 2010. Gesture Controllers. ACM Transactions on Graphics 29, 4 (2010), 1--11.
[39]
Stacy Marsella, Yuyu Xu, Margaux Lhommet, Andrew Feng, Stefan Scherer, and Ari Shapiro. 2013. Virtual Character Performance From Speech. In ACM SIGGRAPH/Eurographics Symposium on Computer Animation. 25--35.
[40]
Leland McInnes, John Healy, Nathaniel Saul, and Lukas Großberger. 2018. UMAP: Uniform Manifold Approximation and Projection. Journal of Open Source Software 3 (2018).
[41]
David McNeill. 1992. Hand and Mind: What Gestures Reveal About Thought. University of Chicago press.
[42]
David McNeill. 2008. Gesture and Thought. University of Chicago press.
[43]
Alberto Menache. 2000. Understanding Motion Capture for Computer Animation and Video Games. Morgan Kaufmann.
[44]
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed Representations of Words and Phrases and their Compositionality. In Advances in Neural Information Processing Systems. 3111--3119.
[45]
George A Miller. 1995. WordNet: A Lexical Database for English. Commun. ACM 38, 11 (1995), 39--41.
[46]
Michael Neff, Michael Kipp, Irene Albrecht, and Hans-Peter Seidel. 2008. Gesture Modeling and Animation Based on a Probabilistic Recreation of Speaker Style. ACM Transactions on Graphics 27, 1 (2008), 5.
[47]
Robert Ochshorn and Max Hawkins. 2016. Gentle: A Forced Aligner. https://lowerquality.com/gentle/ Accessed: 2020-01-06.
[48]
Dario Pavllo, Christoph Feichtenhofer, David Grangier, and Michael Auli. 2019. 3D Human Pose Estimation in Video With Temporal Convolutions and Semi-Supervised Training. In IEEE Conference on Computer Vision and Pattern Recognition. 7753--7762.
[49]
Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. GloVe: Global Vectors for Word Representation. In Empirical Methods in Natural Language Processing. 1532--1543.
[50]
Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. 2014. Stochastic back-propagation and approximate inference in deep generative models. In Proceedings of the International Conference on Machine Learning, Vol. 32. 1278--1286.
[51]
Matthew Roddy, Gabriel Skantze, and Naomi Harte. 2018. Multimodal Continuous Turn-Taking Prediction Using Multiscale RNNs. In ACM International Conference on Multimodal Interaction. ACM, 186--190.
[52]
David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. 1985. Learning Internal Representations by Error Propagation. Technical Report. California Univ San Diego La Jolla Inst for Cognitive Science.
[53]
Najmeh Sadoughi and Carlos Busso. 2019. Speech-Driven Animation with Meaningful Behaviors. Speech Communication 110 (2019), 90--100.
[54]
Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. 2016. Improved Techniques for Training GANs. In Advances in Neural Information Processing Systems. 2234--2242.
[55]
Robotics Softbank. 2018. NAOqi API Documentation. http://doc.aldebaran.com/2-5/index_dev_guide.html Accessed: 2020-01-06.
[56]
Thomas Unterthiner, Sjoerd van Steenkiste, Karol Kurach, Raphaël Marinier, Marcin Michalski, and Sylvain Gelly. 2019. FVD: A new Metric for Video Generation. In International Conference on Learning Representations Workshop.
[57]
Petra Wagner, Zofia Malisz, and Stefan Kopp. 2014. Gesture and Speech in Interaction: An Overview. Speech Communication 57, Special Iss. (2014).
[58]
Jason R Wilson, Nah Young Lee, Annie Saechao, Sharon Hershenson, Matthias Scheutz, and Linda Tickle-Degnen. 2017. Hand Gestures and Verbal Acknowledgments Improve Human-Robot Rapport. In International Conference on Social Robotics. Springer, 334--344.
[59]
Dingdong Yang, Seunghoon Hong, Yunseok Jang, Tianchen Zhao, and Honglak Lee. 2019. Diversity-Sensitive Conditional Generative Adversarial Networks. In International Conference on Learning Representations.
[60]
Youngwoo Yoon, Woo-Ri Ko, Minsu Jang, Jaeyeon Lee, Jaehong Kim, and Geehyuk Lee. 2019. Robots Learn Social Skills: End-to-End Learning of Co-Speech Gesture Generation for Humanoid Robots. In International Conference on Robotics and Automation. IEEE, 4303--4309.
[61]
Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. 2018. The Unreasonable Effectiveness of Deep Features as a Perceptual Metric. In IEEE Conference on Computer Vision and Pattern Recognition. 586--595.

Cited By

View all
  • (2024)Audio2DiffuGesture: Generating a diverse co-speech gesture based on a diffusion modelElectronic Research Archive10.3934/era.202425032:9(5392-5408)Online publication date: 2024
  • (2024)Creating Expressive Social Robots That Convey Symbolic and Spontaneous CommunicationSensors10.3390/s2411367124:11(3671)Online publication date: 5-Jun-2024
  • (2024)A Generative Model to Embed Human Expressivity into Robot MotionsSensors10.3390/s2402056924:2(569)Online publication date: 16-Jan-2024
  • Show More Cited By

Index Terms

  1. Speech gesture generation from the trimodal context of text, audio, and speaker identity

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Transactions on Graphics
      ACM Transactions on Graphics  Volume 39, Issue 6
      December 2020
      1605 pages
      ISSN:0730-0301
      EISSN:1557-7368
      DOI:10.1145/3414685
      Issue’s Table of Contents
      © 2020 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of a national government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 27 November 2020
      Published in TOG Volume 39, Issue 6

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. co-speech gesture
      2. evaluation of a generative model
      3. multimodality
      4. neural generative model
      5. nonverbal behavior

      Qualifiers

      • Research-article

      Funding Sources

      • Korea government (MSIT)

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)343
      • Downloads (Last 6 weeks)36
      Reflects downloads up to 25 Oct 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Audio2DiffuGesture: Generating a diverse co-speech gesture based on a diffusion modelElectronic Research Archive10.3934/era.202425032:9(5392-5408)Online publication date: 2024
      • (2024)Creating Expressive Social Robots That Convey Symbolic and Spontaneous CommunicationSensors10.3390/s2411367124:11(3671)Online publication date: 5-Jun-2024
      • (2024)A Generative Model to Embed Human Expressivity into Robot MotionsSensors10.3390/s2402056924:2(569)Online publication date: 16-Jan-2024
      • (2024)Exploring the Effectiveness of Evaluation Practices for Computer-Generated Nonverbal BehaviourApplied Sciences10.3390/app1404146014:4(1460)Online publication date: 10-Feb-2024
      • (2024)EGGesture: Entropy-Guided Vector Quantized Variational AutoEncoder for Co-Speech Gesture GenerationProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681392(6113-6122)Online publication date: 28-Oct-2024
      • (2024)Emphasizing Semantic Consistency of Salient Posture for Speech-Driven Gesture GenerationProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680892(7027-7035)Online publication date: 28-Oct-2024
      • (2024)Enabling Synergistic Full-Body Control in Prompt-Based Co-Speech Motion GenerationProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680847(6774-6783)Online publication date: 28-Oct-2024
      • (2024)MDT-A2G: Exploring Masked Diffusion Transformers for Co-Speech Gesture GenerationProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680684(3266-3274)Online publication date: 28-Oct-2024
      • (2024)Semantic Gesticulator: Semantics-Aware Co-Speech Gesture SynthesisACM Transactions on Graphics10.1145/365813443:4(1-17)Online publication date: 19-Jul-2024
      • (2024)Evaluating Gesture Generation in a Large-scale Open Challenge: The GENEA Challenge 2022ACM Transactions on Graphics10.1145/365637443:3(1-28)Online publication date: 27-Apr-2024
      • Show More Cited By

      View Options

      Get Access

      Login options

      Full Access

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media