research-article

Speech gesture generation from the trimodal context of text, audio, and speaker identity

Authors:

Geehyuk LeeAuthors Info & Claims

ACM Transactions on Graphics (TOG), Volume 39, Issue 6

Article No.: 222, Pages 1 - 16

https://doi.org/10.1145/3414685.3417838

Published: 27 November 2020 Publication History

Abstract

For human-like agents, including virtual avatars and social robots, making proper gestures while speaking is crucial in human-agent interaction. Co-speech gestures enhance interaction experiences and make the agents look alive. However, it is difficult to generate human-like gestures due to the lack of understanding of how people gesture. Data-driven approaches attempt to learn gesticulation skills from human demonstrations, but the ambiguous and individual nature of gestures hinders learning. In this paper, we present an automatic gesture generation model that uses the multimodal context of speech text, audio, and speaker identity to reliably generate gestures. By incorporating a multimodal context and an adversarial training scheme, the proposed model outputs gestures that are human-like and that match with speech content and rhythm. We also introduce a new quantitative evaluation metric for gesture generation models. Experiments with the introduced metric and subjective human evaluation showed that the proposed gesture generation model is better than existing end-to-end generation models. We further confirm that our model is able to work with synthesized audio in a scenario where contexts are constrained, and show that different gesture styles can be generated for the same speech by specifying different speaker identities in the style embedding space that is learned from videos of various speakers. All the code and data is available at https://github.com/ai4r/Gesture-Generation-from-Trimodal-Context.

Supplementary Material

MP4 File (a222-yoon.mp4)

Download
201.95 MB

MP4 File (3414685.3417838.mp4)

Presentation video

Download
1244.93 MB

References

[1]

Chaitanya Ahuja and Louis-Philippe Morency. 2019. Language2Pose: Natural Language Grounded Pose Forecasting. In International Conference on 3D Vision. IEEE, 719--728.

[2]

Simon Alexanderson, Gustav Eje Henter, Taras Kucherenko, and Jonas Beskow. 2020. Style-Controllable Speech-Driven Gesture Synthesis Using Normalising Flows. In Computer Graphics Forum, Vol. 39. Wiley Online Library, 487--496.

[3]

Andreas Aristidou, Efstathios Stavrakis, Panayiotis Charalambous, Yiorgos Chrysanthou, and Stephania Loizidou Himona. 2015. Folk Dance Evaluation Using Laban Movement Analysis. Journal on Computing and Cultural Heritage 8, 4 (2015), 20.

Digital Library

[4]

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural Machine Translation by Jointly Learning to Align and Translate. International Conference on Learning Representations.

[5]

Shaojie Bai, J. Zico Kolter, and Vladlen Koltun. 2018. An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling. arXiv:1803.01271 (2018).

[6]

Tadas Baltrušaitis, Chaitanya Ahuja, and Louis-Philippe Morency. 2018. Multimodal Machine Learning: A Survey and Taxonomy. IEEE Transactions on Pattern Analysis and Machine Intelligence 41, 2 (2018), 423--443.

Digital Library

[7]

Kirsten Bergmann, Volkan Aksu, and Stefan Kopp. 2011. The Relation of Speech and Gestures: Temporal Synchrony Follows Semantic Synchrony. In Proceedings of the 2nd Workshop on Gesture and Speech in Interaction.

[8]

Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching Word Vectors with Subword Information. Transactions of the Association for Computational Linguistics 5 (2017), 135--146.

[9]

Ali Borji. 2019. Pros and Cons of GAN Evaluation Measures. Computer Vision and Image Understanding 179 (2019), 41--65.

Digital Library

[10]

Paul Bremner, Anthony G Pipe, Chris Melhuish, Mike Fraser, and Sriram Subramanian. 2011. The Effects of Robot-Performed Co-Verbal Gesture on Listener Behaviour. In IEEE-RAS International Conference on Humanoid Robots. IEEE, 458--465.

[11]

Judee K Burgoon, Thomas Birk, and Michael Pfau. 1990. Nonverbal Behaviors, Persuasion, and Credibility. Human communication research 17, 1 (1990), 140--169.

[12]

Justine Cassell, Hannes Högni Vilhjálmsson, and Timothy Bickmore. 2004. BEAT: the Behavior Expression Animation Toolkit. In Life-Like Characters. Springer, 163--185.

[13]

Chung-Cheng Chiu, Louis-Philippe Morency, and Stacy Marsella. 2015. Predicting Co-verbal Gestures: A Deep and Temporal Modeling Approach. In ACM International Conference on Intelligent Virtual Agents. Springer, 152--166.

[14]

Kyunghyun Cho, Bart van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. In Empirical Methods in Natural Language Processing. 1724--1734.

[15]

Mingyuan Chu and Peter Hagoort. 2014. Synchronization of Speech and Gesture: Evidence for Interaction in Action. Journal of Experimental Psychology: General 143, 4 (2014), 1726.

[16]

Wei Chu and Zoubin Ghahramani. 2005. Extensions of Gaussian Processes for Ranking: Semi-supervised and Active Learning. Learning to Rank (2005), 29.

[17]

Andrew P Clark, Kate L Howard, Andy T Woods, Ian S Penton-Voak, and Christof Neumann. 2018. Why Rate When You Could Compare? Using the "EloChoice" Package to Assess Pairwise Comparisons of Perceived Physical Strength. PloS one 13, 1 (2018).

[18]

Ylva Ferstl, Michael Neff, and Rachel McDonnell. 2019. Multi-Objective Adversarial Gesture Generation. In Motion, Interaction and Games. 1--10.

[19]

Peng Fu, Zheng Lin, Fengcheng Yuan, Weiping Wang, and Dan Meng. 2018. Learning Sentiment-Specific Word Embedding via Global Sentiment Representation. In AAAI Conference on Artificial Intelligence.

[20]

Shiry Ginosar, Amir Bar, Gefen Kohavi, Caroline Chan, Andrew Owens, and Jitendra Malik. 2019. Learning Individual Styles of Conversational Gesture. In IEEE Conference on Computer Vision and Pattern Recognition. 3497--3506.

[21]

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative Adversarial Nets. In Advances in Neural Information Processing Systems. 2672--2680.

[22]

Google. 2018. Google Cloud Text-to-Speech. https://cloud.google.com/text-to-speech Accessed: 2020-03-01.

[23]

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. 2017. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. In Advances in Neural Information Processing Systems. 6626--6637.

[24]

Autumn B Hostetter and Andrea L Potthoff. 2012. Effects of Personality and Social Situation on Representational Gesture Production. Gesture 12, 1 (2012), 62--83.

[25]

Chien-Ming Huang and Bilge Mutlu. 2014. Learning-Based Modeling of Multimodal Behaviors for Humanlike Robots. In ACM/IEEE International Conference on Human-Robot Interaction. ACM, 57--64.

[26]

Peter J Huber. 1964. Robust Estimation of a Location Parameter. The Annals of Mathematical Statistics 35, 1 (1964), 73--101.

[27]

Catalin Ionescu, Dragos Papava, Vlad Olaru, and Cristian Sminchisescu. 2013. Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments. IEEE Transactions on Pattern Analysis and Machine Intelligence 36, 7 (2013), 1325--1339.

Digital Library

[28]

Ali Jahanian, Lucy Chai, and Phillip Isola. 2020. On the "Steerability" of Generative Adversarial Networks. In International Conference on Learning Representations.

[29]

Hanbyul Joo, Tomas Simon, Mina Cikara, and Yaser Sheikh. 2019. Towards Social Artificial Intelligence: Nonverbal Social Signal Prediction in A Triadic Interaction. In IEEE Conference on Computer Vision and Pattern Recognition. 10873--10883.

[30]

Kevin Kilgour, Mauricio Zuluaga, Dominik Roblek, and Matthew Sharifi. 2018. Fréchet Audio Distance: A Metric for Evaluating Music Enhancement Algorithms. arXiv preprint arXiv:1812.08466 (2018).

[31]

Taewoo Kim and Joo-Haeng Lee. 2020. C-3PO: Cyclic-Three-Phase Optimization for Human-Robot Motion Retargeting based on Reinforcement Learning. In International Conference on Robotics and Automation.

[32]

Diederik P Kingma and Max Welling. 2014. Auto-Encoding Variational Bayes. In International Conference on Learning Representations.

[33]

Michael Kipp. 2005. Gesture Generation by Imitation: From Human Behavior to Computer Character Animation. Universal-Publishers.

[34]

Sotaro Kita. 2000. How Representational Gestures Help Speaking. Language and gesture 1 (2000), 162--185.

[35]

Stefan Kopp, Brigitte Krenn, Stacy Marsella, Andrew N Marshall, Catherine Pelachaud, Hannes Pirker, Kristinn R Thórisson, and Hannes Vilhjálmsson. 2006. Towards a Common Framework for Multimodal Generation: The Behavior Markup Language. In ACM International Conference on Intelligent Virtual Agents. Springer, 205--217.

Digital Library

[36]

Taras Kucherenko, Dai Hasegawa, Gustav Eje Henter, Naoshi Kaneko, and Hedvig Kjellström. 2019. Analyzing Input and Output Representations for Speech-Driven Gesture Generation. In ACM International Conference on Intelligent Virtual Agents. 97--104.

[37]

Taras Kucherenko, Patrik Jonell, Sanne van Waveren, Gustav Eje Henter, Simon Alexanderson, Iolanda Leite, and Hedvig Kjellström. 2020. Gesticulator: A Framework for Semantically-Aware Speech-Driven Gesture Generation. In ACM International Conference on Multimodal Interaction.

[38]

Sergey Levine, Philipp Krähenbühl, Sebastian Thrun, and Vladlen Koltun. 2010. Gesture Controllers. ACM Transactions on Graphics 29, 4 (2010), 1--11.

Digital Library

[39]

Stacy Marsella, Yuyu Xu, Margaux Lhommet, Andrew Feng, Stefan Scherer, and Ari Shapiro. 2013. Virtual Character Performance From Speech. In ACM SIGGRAPH/Eurographics Symposium on Computer Animation. 25--35.

[40]

Leland McInnes, John Healy, Nathaniel Saul, and Lukas Großberger. 2018. UMAP: Uniform Manifold Approximation and Projection. Journal of Open Source Software 3 (2018).

[41]

David McNeill. 1992. Hand and Mind: What Gestures Reveal About Thought. University of Chicago press.

[42]

David McNeill. 2008. Gesture and Thought. University of Chicago press.

[43]

Alberto Menache. 2000. Understanding Motion Capture for Computer Animation and Video Games. Morgan Kaufmann.

[44]

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed Representations of Words and Phrases and their Compositionality. In Advances in Neural Information Processing Systems. 3111--3119.

[45]

George A Miller. 1995. WordNet: A Lexical Database for English. Commun. ACM 38, 11 (1995), 39--41.

Digital Library

[46]

Michael Neff, Michael Kipp, Irene Albrecht, and Hans-Peter Seidel. 2008. Gesture Modeling and Animation Based on a Probabilistic Recreation of Speaker Style. ACM Transactions on Graphics 27, 1 (2008), 5.

Digital Library

[47]

Robert Ochshorn and Max Hawkins. 2016. Gentle: A Forced Aligner. https://lowerquality.com/gentle/ Accessed: 2020-01-06.

[48]

Dario Pavllo, Christoph Feichtenhofer, David Grangier, and Michael Auli. 2019. 3D Human Pose Estimation in Video With Temporal Convolutions and Semi-Supervised Training. In IEEE Conference on Computer Vision and Pattern Recognition. 7753--7762.

[49]

Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. GloVe: Global Vectors for Word Representation. In Empirical Methods in Natural Language Processing. 1532--1543.

[50]

Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. 2014. Stochastic back-propagation and approximate inference in deep generative models. In Proceedings of the International Conference on Machine Learning, Vol. 32. 1278--1286.

[51]

Matthew Roddy, Gabriel Skantze, and Naomi Harte. 2018. Multimodal Continuous Turn-Taking Prediction Using Multiscale RNNs. In ACM International Conference on Multimodal Interaction. ACM, 186--190.

[52]

David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. 1985. Learning Internal Representations by Error Propagation. Technical Report. California Univ San Diego La Jolla Inst for Cognitive Science.

[53]

Najmeh Sadoughi and Carlos Busso. 2019. Speech-Driven Animation with Meaningful Behaviors. Speech Communication 110 (2019), 90--100.

Digital Library

[54]

Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. 2016. Improved Techniques for Training GANs. In Advances in Neural Information Processing Systems. 2234--2242.

[55]

Robotics Softbank. 2018. NAOqi API Documentation. http://doc.aldebaran.com/2-5/index_dev_guide.html Accessed: 2020-01-06.

[56]

Thomas Unterthiner, Sjoerd van Steenkiste, Karol Kurach, Raphaël Marinier, Marcin Michalski, and Sylvain Gelly. 2019. FVD: A new Metric for Video Generation. In International Conference on Learning Representations Workshop.

[57]

Petra Wagner, Zofia Malisz, and Stefan Kopp. 2014. Gesture and Speech in Interaction: An Overview. Speech Communication 57, Special Iss. (2014).

[58]

Jason R Wilson, Nah Young Lee, Annie Saechao, Sharon Hershenson, Matthias Scheutz, and Linda Tickle-Degnen. 2017. Hand Gestures and Verbal Acknowledgments Improve Human-Robot Rapport. In International Conference on Social Robotics. Springer, 334--344.

[59]

Dingdong Yang, Seunghoon Hong, Yunseok Jang, Tianchen Zhao, and Honglak Lee. 2019. Diversity-Sensitive Conditional Generative Adversarial Networks. In International Conference on Learning Representations.

[60]

Youngwoo Yoon, Woo-Ri Ko, Minsu Jang, Jaeyeon Lee, Jaehong Kim, and Geehyuk Lee. 2019. Robots Learn Social Skills: End-to-End Learning of Co-Speech Gesture Generation for Humanoid Robots. In International Conference on Robotics and Automation. IEEE, 4303--4309.

[61]

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. 2018. The Unreasonable Effectiveness of Deep Features as a Perceptual Metric. In IEEE Conference on Computer Vision and Pattern Recognition. 586--595.

Cited By

Yao HXu YWU WHe HRen WCai Z(2024)Audio2DiffuGesture: Generating a diverse co-speech gesture based on a diffusion modelElectronic Research Archive10.3934/era.202425032:9(5392-5408)Online publication date: 2024
https://doi.org/10.3934/era.2024250
Fernández-Rodicio ECastro-González ÁGamboa-Montero JCarrasco-Martínez SSalichs M(2024)Creating Expressive Social Robots That Convey Symbolic and Spontaneous CommunicationSensors10.3390/s2411367124:11(3671)Online publication date: 5-Jun-2024
https://doi.org/10.3390/s24113671
Osorio PSagawa RAbe NVenture G(2024)A Generative Model to Embed Human Expressivity into Robot MotionsSensors10.3390/s2402056924:2(569)Online publication date: 16-Jan-2024
https://doi.org/10.3390/s24020569
Show More Cited By

Index Terms

Speech gesture generation from the trimodal context of text, audio, and speaker identity
1. Computing methodologies
  1. Computer graphics
    1. Animation
  2. Machine learning
    1. Learning paradigms
      1. Supervised learning
        Supervised learning by regression

Recommendations

It's A Match! Gesture Generation Using Expressive Parameter Matching
AAMAS '21: Proceedings of the 20th International Conference on Autonomous Agents and MultiAgent Systems

Automatic gesture generation from speech generally relies on implicit modelling of the nondeterministic speech-gesture relationship and can result in averaged motion lacking defined form. Here, we propose a database-driven approach of selecting gestures ...
Tracking Discourse Topics in Co-speech Gesture
Digital Human Modeling and Applications in Health, Safety, Ergonomics and Risk Management. Human Body, Motion and Behavior
Abstract
This paper argues for the integration of co-speech gesture into formal models of discourse structures. In particular, I use a variation of the Question Under Discussion (QUD) framework to show how features of co-speech gesture may be used to ...
Emphasizing Semantic Consistency of Salient Posture for Speech-Driven Gesture Generation
MM '24: Proceedings of the 32nd ACM International Conference on Multimedia

Speech-driven gesture generation aims at synthesizing a gesture sequence synchronized with the input speech signal. Previous methods leverage neural networks to directly map a compact audio representation to the gesture sequence, ignoring the semantic ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Graphics

ACM Transactions on Graphics Volume 39, Issue 6

December 2020

1605 pages

ISSN:0730-0301

EISSN:1557-7368

DOI:10.1145/3414685

Editor:
Karol Myszkowski
MPI Informatik

Issue’s Table of Contents

Copyright © 2020 ACM.

© 2020 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of a national government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 November 2020

Published in TOG Volume 39, Issue 6

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Korea government (MSIT)

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

167
Total Citations
View Citations
1,407
Total Downloads

Downloads (Last 12 months)343
Downloads (Last 6 weeks)36

Reflects downloads up to 25 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Yao HXu YWU WHe HRen WCai Z(2024)Audio2DiffuGesture: Generating a diverse co-speech gesture based on a diffusion modelElectronic Research Archive10.3934/era.202425032:9(5392-5408)Online publication date: 2024
https://doi.org/10.3934/era.2024250
Fernández-Rodicio ECastro-González ÁGamboa-Montero JCarrasco-Martínez SSalichs M(2024)Creating Expressive Social Robots That Convey Symbolic and Spontaneous CommunicationSensors10.3390/s2411367124:11(3671)Online publication date: 5-Jun-2024
https://doi.org/10.3390/s24113671
Osorio PSagawa RAbe NVenture G(2024)A Generative Model to Embed Human Expressivity into Robot MotionsSensors10.3390/s2402056924:2(569)Online publication date: 16-Jan-2024
https://doi.org/10.3390/s24020569
Wolfert PHenter GBelpaeme T(2024)Exploring the Effectiveness of Evaluation Practices for Computer-Generated Nonverbal BehaviourApplied Sciences10.3390/app1404146014:4(1460)Online publication date: 10-Feb-2024
https://doi.org/10.3390/app14041460
Xiao YShu KZhang HYin BCheang WWang HGao JCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)EGGesture: Entropy-Guided Vector Quantized Variational AutoEncoder for Co-Speech Gesture GenerationProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681392(6113-6122)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681392
Liu FWang HGong JYi RZhou QLu XLu JMa LCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Emphasizing Semantic Consistency of Salient Posture for Speech-Driven Gesture GenerationProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680892(7027-7035)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3680892
Chen BLi YDing YShao TZhou KCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Enabling Synergistic Full-Body Control in Prompt-Based Co-Speech Motion GenerationProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680847(6774-6783)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3680847
Mao XJiang ZWang QFu CZhang JWu JWang YWang CLi WChi MCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)MDT-A2G: Exploring Masked Diffusion Transformers for Co-Speech Gesture GenerationProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680684(3266-3274)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3680684
Zhang ZAo TZhang YGao QLin CChen BLiu L(2024)Semantic Gesticulator: Semantics-Aware Co-Speech Gesture SynthesisACM Transactions on Graphics10.1145/365813443:4(1-17)Online publication date: 19-Jul-2024
https://dl.acm.org/doi/10.1145/3658134
Kucherenko TWolfert PYoon YViegas CNikolov TTsakov MHenter G(2024)Evaluating Gesture Generation in a Large-scale Open Challenge: The GENEA Challenge 2022ACM Transactions on Graphics10.1145/365637443:3(1-28)Online publication date: 27-Apr-2024
https://dl.acm.org/doi/10.1145/3656374
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents