research-article

Bodyformer: Semantics-guided 3D Body Gesture Synthesis with Transformer

Authors:

Julian Habekost,

Takaaki Shiratori,

Junichi Yamagishi, and

Taku KomuraAuthors Info & Claims

ACM Transactions on Graphics (TOG), Volume 42, Issue 4

Article No.: 43, Pages 1 - 12

https://doi.org/10.1145/3592456

Published: 26 July 2023 Publication History

Abstract

Automatic gesture synthesis from speech is a topic that has attracted researchers for applications in remote communication, video games and Metaverse. Learning the mapping between speech and 3D full-body gestures is difficult due to the stochastic nature of the problem and the lack of a rich cross-modal dataset that is needed for training. In this paper, we propose a novel transformer-based framework for automatic 3D body gesture synthesis from speech. To learn the stochastic nature of the body gesture during speech, we propose a variational transformer to effectively model a probabilistic distribution over gestures, which can produce diverse gestures during inference. Furthermore, we introduce a mode positional embedding layer to capture the different motion speeds in different speaking modes. To cope with the scarcity of data, we design an intra-modal pre-training scheme that can learn the complex mapping between the speech and the 3D gesture from a limited amount of data. Our system is trained with either the Trinity speech-gesture dataset or the Talking With Hands 16.2M dataset. The results show that our system can produce more realistic, appropriate, and diverse body gestures compared to existing state-of-the-art approaches.

Supplementary Material

ZIP File (papers_862-supplemental.zip)

supplemental material

Download
186.16 MB

MP4 File (papers_862_VOD.mp4)

presentation

Download
191.85 MB

References

[1]

Chaitanya Ahuja, D Lee, Ryo Ishii, and L-P Morency. 2020a. No gestures left behind: Learning relationships between spoken language and freeform gestures. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings (EMNLP).

[2]

Chaitanya Ahuja, Dong Won Lee, Yukiko I Nakano, and Louis-Philippe Morency. 2020b. Style transfer for co-speech gesture animation: A multi-speaker conditional-mixture approach. In European Conference on Computer Vision. Springer, 248--265.

Digital Library

[3]

Simon Alexanderson, Gustav Eje Henter, Taras Kucherenko, and Jonas Beskow. 2020. Style-Controllable Speech-Driven Gesture Synthesis Using Normalising Flows. Comput. Graph. Forum 39, 2 (2020), 487--496.

[4]

Tenglong Ao, Qingzhe Gao, Yuke Lou, Baoquan Chen, and Libin Liu. 2022. Rhythmic Gesticulator: Rhythm-Aware Co-Speech Gesture Synthesis with Hierarchical Neural Embeddings. ACM Trans. Graph. 41, 6, Article 209 (nov 2022), 19 pages.

Digital Library

[5]

Uttaran Bhattacharya, Elizabeth Childs, Nicholas Rewkowski, and Dinesh Manocha. 2021a. Speech2AffectiveGestures: Synthesizing Co-Speech Gestures with Generative Adversarial Affective Expression Learning. In Proceedings of the 29th ACM International Conference on Multimedia (Virtual Event, China) (MM '21). Association for Computing Machinery, New York, NY, USA, 2027--2036.

Digital Library

[6]

Uttaran Bhattacharya, Nicholas Rewkowski, Abhishek Banerjee, Pooja Guhan, Aniket Bera, and Dinesh Manocha. 2021b. Text2Gestures: A Transformer-Based Network for Generating Emotive Body Gestures for Virtual Agents. In 2021 IEEE Virtual Reality and 3D User Interfaces (VR). IEEE, 1--10.

[7]

Justine Cassell, David McNeill, and Karl-Erik McCullough. 1999. Speech-gesture mismatches: Evidence for one underlying representation of linguistic and nonlinguistic information. Pragmatics & cognition 7, 1 (1999), 1--34.

[8]

Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc Le, and Ruslan Salakhutdinov. 2019. Transformer-XL: Attentive Language Models beyond a Fixed-Length Context. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Florence, Italy, 2978--2988.

[9]

Jan P De Ruiter, Adrian Bangerter, and Paula Dings. 2012. The interplay between gesture and speech in the production of referring expressions: Investigating the tradeoff hypothesis. Topics in Cognitive Science 4, 2 (2012), 232--248.

[10]

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition. Ieee, 248--255.

[11]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, 4171--4186.

[12]

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2021. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In International Conference on Learning Representations. https://openreview.net/forum?id=YicbFdNTTy

[13]

Ylva Ferstl and Rachel McDonnell. 2018. Investigating the use of recurrent motion modelling for speech gesture generation. In Proceedings of the 18th International Conference on Intelligent Virtual Agents. https://trinityspeechgesture.scss.tcd.ie

Digital Library

[14]

Ylva Ferstl, Michael Neff, and Rachel McDonnell. 2019. Multi-objective adversarial gesture generation. In Motion, Interaction and Games. ACM, 3.

[15]

Hao Fu, Chunyuan Li, Xiaodong Liu, Jianfeng Gao, Asli Celikyilmaz, and Lawrence Carin. 2019. Cyclical Annealing Schedule: A Simple Approach to Mitigating KL Vanishing. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, 240--250.

[16]

F.A. Gers, J. Schmidhuber, and F. Cummins. 1999. Learning to forget: continual prediction with LSTM. In 1999 Ninth International Conference on Artificial Neural Networks ICANN 99. (Conf. Publ. No. 470), Vol. 2. 850--855 vol.2.

[17]

Saeed Ghorbani, Ylva Ferstl, Daniel Holden, Nikolaus F. Troje, and Marc-André Carbonneau. 2023. ZeroEGGS: Zero-shot Example-based Gesture Generation from Speech. Computer Graphics Forum 42, 1 (2023), 206--216. arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1111/cgf.14734

[18]

Shiry Ginosar, Amir Bar, Gefen Kohavi, Caroline Chan, Andrew Owens, and Jitendra Malik. 2019. Learning Individual Styles of Conversational Gesture. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3497--3506.

[19]

Ikhsanul Habibie, Mohamed Elgharib, Kripashindu Sarkar, Ahsan Abdullah, Simbarashe Nyatsanga, Michael Neff, and Christian Theobalt. 2022. A Motion Matching-based Framework for Controllable Gesture Synthesis from Speech. In SIGGRAPH '22 Conference Proceedings.

Digital Library

[20]

Ikhsanul Habibie, Weipeng Xu, Dushyant Mehta, Lingjie Liu, Hans-Peter Seidel, Gerard Pons-Moll, Mohamed Elgharib, and Christian Theobalt. 2021. Learning speech-driven 3D conversational gestures from video. In Proceedings of the 21st ACM International Conference on Intelligent Virtual Agents. 101--108.

Digital Library

[21]

Gustav Eje Henter, Simon Alexanderson, and Jonas Beskow. 2020. MoGlow: Probabilistic and controllable motion synthesis using normalising flows. ACM Transactions on Graphics 39, 4 (2020), 236:1--236:14.

Digital Library

[22]

Adam Kendon. 1972. Some relationships between body motion and speech. Studies in dyadic communication 7, 177 (1972), 90.

[23]

Salman Khan, Muzammal Naseer, Munawar Hayat, Syed Waqas Zamir, Fahad Shahbaz Khan, and Mubarak Shah. 2021. Transformers in vision: A survey. ACM Computing Surveys (CSUR) (2021).

Digital Library

[24]

Taras Kucherenko, Dai Hasegawa, Gustav Eje Henter, Naoshi Kaneko, and Hedvig Kjellström. 2019. Analyzing input and output representations for speech-driven gesture generation. In Proceedings of the 19th ACM International Conference on Intelligent Virtual Agents. 97--104.

Digital Library

[25]

Taras Kucherenko, Patrik Jonell, Sanne van Waveren, Gustav Eje Henter, Simon Alexandersson, Iolanda Leite, and Hedvig Kjellström. 2020. Gesticulator: A framework for semantically-aware speech-driven gesture generation. In Proceedings of the 2020 International Conference on Multimodal Interaction. 242--250.

Digital Library

[26]

Taras Kucherenko, Patrik Jonell, Youngwoo Yoon, Pieter Wolfert, and Gustav Eje Henter. 2021. A Large, Crowdsourced Evaluation of Gesture Generation Systems on Common Data: The GENEA Challenge 2020. In 26th International Conference on Intelligent User Interfaces (College Station, TX, USA) (IUI '21). Association for Computing Machinery, New York, NY, USA, 11--21.

Digital Library

[27]

Gilwoo Lee, Zhiwei Deng, Shugao Ma, Takaaki Shiratori, Siddhartha S. Srinivasa, and Yaser Sheikh. 2019a. Talking With Hands 16.2M: A Large-Scale Dataset of Synchronized Body-Finger Motion and Audio for Conversational Motion Analysis and Synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV).

[28]

Juho Lee, Yoonho Lee, Jungtaek Kim, Adam R. Kosiorek, Seungjin Choi, and Yee Whye Teh. 2019b. Set Transformer: A Framework for Attention-based Permutation-Invariant Neural Networks. In Proceedings of International Conference on Machine Learning. 3744--3753.

[29]

Jing Li, Di Kang, Wenjie Pei, Xuefei Zhe, Ying Zhang, Zhenyu He, and Linchao Bao. 2021a. Audio2Gestures: Generating Diverse Gestures from Speech Audio with Conditional Variational Autoencoders. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 11293--11302.

[30]

Ruilong Li, Shan Yang, David A. Ross, and Angjoo Kanazawa. 2021b. Learn to Dance with AIST++: Music Conditioned 3D Dance Generation. In Proceedings of IEEE/CVF International Conference on Computer Vision.

[31]

Haiyang Liu, Naoya Iwamoto, Zihao Zhu, Zhengqing Li, You Zhou, Elif Bozkurt, and Bo Zheng. 2022a. DisCo: Disentangled Implicit Content and Rhythm Learning for Diverse Co-Speech Gestures Synthesis. In Proceedings of the 30th ACM International Conference on Multimedia (Lisboa, Portugal) (MM '22). Association for Computing Machinery, New York, NY, USA, 3764--3773.

Digital Library

[32]

Haiyang Liu, Zihao Zhu, Naoya Iwamoto, Yichen Peng, Zhengqing Li, You Zhou, Elif Bozkurt, and Bo Zheng. 2022c. BEAT: A Large-Scale Semantic and Emotional Multi-Modal Dataset for Conversational Gestures Synthesis. arXiv preprint arXiv:2203.05297 (2022).

[33]

Xian Liu, Qianyi Wu, Hang Zhou, Yinghao Xu, Rui Qian, Xinyi Lin, Xiaowei Zhou, Wayne Wu, Bo Dai, and Bolei Zhou. 2022b. Learning Hierarchical Cross-Modal Association for Co-Speech Gesture Generation. arXiv preprint arXiv:2203.13161 (2022).

[34]

Ilya Loshchilov and Frank Hutter. 2019. Decoupled Weight Decay Regularization. In International Conference on Learning Representations. https://openreview.net/forum?id=Bkg6RiCqY7

[35]

Brian McFee, Colin Raffel, Dawen Liang, Daniel PW Ellis, Matt McVicar, Eric Battenberg, and Oriol Nieto. 2015. librosa: Audio and music signal analysis in python. In Proceedings of the 14th python in science conference, Vol. 8.

[36]

David McNeill. 1992. Hand and mind: What gestures reveal about thought. University of Chicago press.

[37]

Daniel S. Park, William Chan, Yu Zhang, Chung-Cheng Chiu, Barret Zoph, Ekin Dogus Cubuk, and Quoc V. Le. 2019. SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition. In INTERSPEECH.

[38]

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (Eds.). Curran Associates, Inc., 8024--8035. http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf

Digital Library

[39]

Shenhan Qian, Zhi Tu, Yihao Zhi, Wen Liu, and Shenghua Gao. 2021. Speech Drives Templates: Co-Speech Gesture Synthesis with Learned Templates. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 11077--11086.

[40]

Eli Shlizerman, Lucio Dery, Hayden Schoen, and Ira Kemelmacher-Shlizerman. 2018. Audio to body dynamics. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7574--7583.

[41]

Guillermo Valle-Pérez, Gustav Eje Henter, Jonas Beskow, Andre Holzapfel, Pierre-Yves Oudeyer, and Simon Alexanderson. 2021. Transflower: Probabilistic Autoregressive Dance Generation with Multimodal Attention. ACM Trans. Graph. 40, 6, Article 195 (dec 2021), 14 pages.

Digital Library

[42]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Advances in Neural Information Processing Systems, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30. Curran Associates, Inc. https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf

Digital Library

[43]

Petra Wagner, Zofia Malisz, and Stefan Kopp. 2014. Gesture and speech in interaction: An overview.

[44]

Benyou Wang, Lifeng Shang, Christina Lioma, Xin Jiang, Hao Yang, Qun Liu, and Jakob Grue Simonsen. 2021b. On Position Embeddings in BERT. In Proceedings of International Conference on Learning Representations. https://openreview.net/forum?id=onxoVA9FxMw

[45]

Siyang Wang, Simon Alexanderson, Joakim Gustafson, Jonas Beskow, Gustav Eje Henter, and Éva Székely. 2021a. Integrated Speech and Gesture Synthesis. In Proceedings of the 2021 International Conference on Multimodal Interaction (Montréal, QC, Canada) (ICMI '21). Association for Computing Machinery, New York, NY, USA, 177--185.

Digital Library

[46]

Youngwoo Yoon, Bok Cha, Joo-Haeng Lee, Minsu Jang, Jaeyeon Lee, Jaehong Kim, and Geehyuk Lee. 2020. Speech Gesture Generation from the Trimodal Context of Text, Audio, and Speaker Identity. ACM Trans. Graph. 39, 6 (nov 2020).

Digital Library

[47]

Youngwoo Yoon, Woo-Ri Ko, Minsu Jang, Jaeyeon Lee, Jaehong Kim, and Geehyuk Lee. 2019. Robots learn social skills: End-to-end learning of co-speech gesture generation for humanoid robots. In 2019 International Conference on Robotics and Automation (ICRA). IEEE, 4303--4309.

Digital Library

[48]

He Zhang, Sebastian Starke, Taku Komura, and Jun Saito. 2018. Mode-adaptive neural networks for quadruped motion control. ACM Transactions on Graphics (TOG) 37, 4 (2018), 145.

Digital Library

[49]

Yi Zhou, Connelly Barnes, Lu Jingwan, Yang Jimei, and Li Hao. 2019. On the Continuity of Rotation Representations in Neural Networks. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[50]

Lingting Zhu, Xian Liu, Xuanyu Liu, Rui Qian, Ziwei Liu, and Lequan Yu. 2023. Taming Diffusion Models for Audio-Driven Co-Speech Gesture Generation. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[51]

Yukun Zhu, Ryan Kiros, Richard S. Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Aligning Books and Movies: Towards StoryLike Visual Explanations by Watching Movies and Reading Books. In Proceedings of IEEE/CVF International Conference on Computer Vision. IEEE Computer Society, 19--27.

Digital Library

Cited By

Pollock SRebholz L(2023)Filtering for Anderson AccelerationSIAM Journal on Scientific Computing10.1137/22M153674145:4(A1571-A1590)Online publication date: 7-Jul-2023
https://doi.org/10.1137/22M1536741

Index Terms

Bodyformer: Semantics-guided 3D Body Gesture Synthesis with Transformer
1. Computing methodologies
  1. Computer graphics
    1. Animation
  2. Machine learning

Recommendations

AQ-GT: a Temporally Aligned and Quantized GRU-Transformer for Co-Speech Gesture Synthesis
ICMI '23: Proceedings of the 25th International Conference on Multimodal Interaction

The generation of realistic and contextually relevant co-speech gestures is a challenging yet increasingly important task in the creation of multimodal artificial agents. Prior methods focused on learning a direct correspondence between co-speech ...
Read More
SAGA: Stochastic Whole-Body Grasping with Contact
Computer Vision – ECCV 2022
Abstract
The synthesis of human grasping has numerous applications including AR/VR, video games and robotics. While methods have been proposed to generate realistic hand–object interaction for object grasping and manipulation, these typically only consider ...
Read More
Human-virtual human interaction by upper body gesture understanding
VRST '13: Proceedings of the 19th ACM Symposium on Virtual Reality Software and Technology

In this paper, a novel human-virtual human interaction system is proposed. This system supports a real human to communicate with a virtual human using natural body language. Meanwhile, the virtual human is capable of understanding the meaning of human ...
Read More

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Graphics

ACM Transactions on Graphics Volume 42, Issue 4

August 2023

1912 pages

ISSN:0730-0301

EISSN:1557-7368

DOI:10.1145/3609020

Issue’s Table of Contents

Copyright © 2023 ACM.

Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of a national government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 26 July 2023

Published in TOG Volume 42, Issue 4

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Innovation and Technology Commission
Research Grant Council

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
377
Total Downloads

Downloads (Last 12 months)377
Downloads (Last 6 weeks)21

Other Metrics

View Author Metrics

Citations

Cited By

Pollock SRebholz L(2023)Filtering for Anderson AccelerationSIAM Journal on Scientific Computing10.1137/22M153674145:4(A1571-A1590)Online publication date: 7-Jul-2023
https://doi.org/10.1137/22M1536741

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents