Speech gesture generation from the trimodal context of text, audio, and speaker identity

Published: 27 November 2020 Publication History


For human-like agents, including virtual avatars and social robots, making proper gestures while speaking is crucial in human-agent interaction. Co-speech gestures enhance interaction experiences and make the agents look alive. However, it is difficult to generate human-like gestures due to the lack of understanding of how people gesture. Data-driven approaches attempt to learn gesticulation skills from human demonstrations, but the ambiguous and individual nature of gestures hinders learning. In this paper, we present an automatic gesture generation model that uses the multimodal context of speech text, audio, and speaker identity to reliably generate gestures. By incorporating a multimodal context and an adversarial training scheme, the proposed model outputs gestures that are human-like and that match with speech content and rhythm. We also introduce a new quantitative evaluation metric for gesture generation models. Experiments with the introduced metric and subjective human evaluation showed that the proposed gesture generation model is better than existing end-to-end generation models. We further confirm that our model is able to work with synthesized audio in a scenario where contexts are constrained, and show that different gesture styles can be generated for the same speech by specifying different speaker identities in the style embedding space that is learned from videos of various speakers. All the code and data is available at https://github.com/ai4r/Gesture-Generation-from-Trimodal-Context.

Cited By

  • (2024)Audio2DiffuGesture: Generating a diverse co-speech gesture based on a diffusion modelElectronic Research Archive10.3934/era.202425032:9(5392-5408)Online publication date: 2024
  • (2024)Creating Expressive Social Robots That Convey Symbolic and Spontaneous CommunicationSensors10.3390/s2411367124:11(3671)Online publication date: 5-Jun-2024
  • (2024)A Generative Model to Embed Human Expressivity into Robot MotionsSensors10.3390/s2402056924:2(569)Online publication date: 16-Jan-2024
      ACM Transactions on Graphics  Volume 39, Issue 6
      December 2020
      1605 pages
      © 2020 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of a national government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.


      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 27 November 2020
      Published in TOG Volume 39, Issue 6


      Author Tags

      1. co-speech gesture
      2. evaluation of a generative model
      3. multimodality
      4. neural generative model
      5. nonverbal behavior


      Funding Sources

      • Korea government (MSIT)


      • Downloads (Last 12 months)343
      • Downloads (Last 6 weeks)36
