research-article

Open access

The IVI Lab entry to the GENEA Challenge 2022 – A Tacotron2 Based Method for Co-Speech Gesture Generation With Locality-Constraint Attention Mechanism

Authors:

Mubbasir KapadiaAuthors Info & Claims

ICMI '22: Proceedings of the 2022 International Conference on Multimodal Interaction

Pages 784 - 789

https://doi.org/10.1145/3536221.3558060

Published: 07 November 2022 Publication History

All formats PDF

Abstract

This paper describes the IVI Lab entry to the GENEA Challenge 2022. We formulate the gesture generation problem as a sequence-to-sequence conversion task with text, audio, and speaker identity as inputs and the body motion as the output. We use the Tacotron2 architecture as our backbone with the locality-constraint attention mechanism that guides the decoder to learn the dependencies from the neighboring latent features. The collective evaluation released by GENEA Challenge 2022 indicates that our two entries (FSH and USK) for the full body and upper body tracks statistically outperform the audio-driven and text-driven baselines on both two subjective metrics. Remarkably, our full-body entry receives the highest speech appropriateness (60.5% matched) among all submitted entries. We also conduct an objective evaluation to compare our motion acceleration and jerk with two autoregressive baselines. The result indicates that the motion distribution of our generated gestures is much closer to the distribution of natural gestures.

References

[1]

Simon Alexanderson, Gustav Eje Henter, Taras Kucherenko, and Jonas Beskow. 2020. Style-controllable speech-driven gesture synthesis using normalising flows. Computer Graphics Forum 39, 2 (2020), 487–496. https://doi.org/10.1111/cgf.13946

[2]

Simon Alexanderson, Éva Székely, Gustav Eje Henter, Taras Kucherenko, and Jonas Beskow. 2020. Generating coherent spontaneous speech and gesture from text. In Proceedings of the 20th ACM International Conference on Intelligent Virtual Agents. 1–3.

Digital Library

[3]

Uttaran Bhattacharya, Elizabeth Childs, Nicholas Rewkowski, and Dinesh Manocha. 2021. Speech2affectivegestures: Synthesizing co-speech gestures with generative adversarial affective expression learning. In Proceedings of the 29th ACM International Conference on Multimedia. 2027–2036.

Digital Library

[4]

Uttaran Bhattacharya, Nicholas Rewkowski, Abhishek Banerjee, Pooja Guhan, Aniket Bera, and Dinesh Manocha. 2021. Text2gestures: A transformer-based network for generating emotive body gestures for virtual agents. In 2021 IEEE Virtual Reality and 3D User Interfaces (VR). IEEE, 1–10.

[5]

Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2016. Enriching Word Vectors with Subword Information. arXiv preprint arXiv:1607.04606(2016).

[6]

Che-Jui Chang. 2020. Transfer Learning from Monolingual ASR to Transcription-free Cross-lingual Voice Conversion. https://doi.org/10.48550/ARXIV.2009.14668

[7]

Ylva Ferstl, Michael Neff, and Rachel McDonnell. 2020. Understanding the predictability of gesture parameters from speech and their perceptual importance. In Proceedings of the 20th ACM International Conference on Intelligent Virtual Agents. 1–8.

Digital Library

[8]

Shiry Ginosar, Amir Bar, Gefen Kohavi, Caroline Chan, Andrew Owens, and Jitendra Malik. 2019. Learning individual styles of conversational gesture. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3497–3506.

[9]

F. Sebastin Grassia. 1998. Practical Parameterization of Rotations Using the Exponential Map. J. Graph. Tools 3, 3 (mar 1998), 29–48. https://doi.org/10.1080/10867651.1998.10487493

Digital Library

[10]

Ikhsanul Habibie, Weipeng Xu, Dushyant Mehta, Lingjie Liu, Hans-Peter Seidel, Gerard Pons-Moll, Mohamed Elgharib, and Christian Theobalt. 2021. Learning speech-driven 3d conversational gestures from video. In Proceedings of the 21st ACM International Conference on Intelligent Virtual Agents. 101–108.

Digital Library

[11]

Yannick Jadoul, Bill Thompson, and Bart de Boer. 2018. Introducing Parselmouth: A Python interface to Praat. Journal of Phonetics 71(2018), 1–15. https://doi.org/10.1016/j.wocn.2018.07.001

[12]

Taras Kucherenko, Patrik Jonell, Sanne van Waveren, Gustav Eje Henter, Simon Alexandersson, Iolanda Leite, and Hedvig Kjellström. 2020. Gesticulator: A framework for semantically-aware speech-driven gesture generation. In Proceedings of the 2020 International Conference on Multimodal Interaction. 242–250.

Digital Library

[13]

Gilwoo Lee, Zhiwei Deng, Shugao Ma, Takaaki Shiratori, Siddhartha S. Srinivasa, and Yaser Sheikh. 2019. Talking With Hands 16.2M: A large-scale dataset of synchronized body-finger motion and audio for conversational motion analysis and synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision(ICCV ’19). IEEE, 763–772. https://doi.org/10.1109/ICCV.2019.00085

[14]

Jing Li, Di Kang, Wenjie Pei, Xuefei Zhe, Ying Zhang, Zhenyu He, and Linchao Bao. 2021. Audio2gestures: Generating diverse gestures from speech audio with conditional variational autoencoders. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 11293–11302.

[15]

Yu Liu, Gelareh Mohammadi, Yang Song, and Wafa Johal. 2021. Speech-based Gesture Generation for Robots and Embodied Agents: A Scoping Review. In Proceedings of the 9th International Conference on Human-Agent Interaction. 31–38.

Digital Library

[16]

Daniel P Loehr. 2012. Temporal, structural, and pragmatic synchrony between intonation and gesture. Laboratory phonology 3, 1 (2012), 71–89.

[17]

Jianwen Luo, Kui Ying, and Jing Bai. 2005. Savitzky–Golay smoothing and differentiation filter for even number data. Signal processing 85, 7 (2005), 1429–1434.

[18]

Brian McFee, Colin Raffel, Dawen Liang, Daniel P Ellis, Matt McVicar, Eric Battenberg, and Oriol Nieto. 2015. librosa: Audio and music signal analysis in python. In Proceedings of the 14th python in science conference, Vol. 8. Citeseer, 18–25.

[19]

Wim Pouw and James A Dixon. 2019. Quantifying gesture-speech synchrony. In the 6th gesture and speech in interaction conference. Universitaetsbibliothek Paderborn, 75–80.

[20]

Jonathan Shen, Ruoming Pang, Ron J Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, Rj Skerrv-Ryan, 2018. Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 4779–4783.

Digital Library

[21]

Pieter Wolfert, Nicole Robinson, and Tony Belpaeme. 2022. A review of evaluation practices of gesture generation in embodied conversational agents. IEEE Transactions on Human-Machine Systems(2022).

[22]

Youngwoo Yoon, Bok Cha, Joo-Haeng Lee, Minsu Jang, Jaeyeon Lee, Jaehong Kim, and Geehyuk Lee. 2020. Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics (TOG) 39, 6 (2020), 1–16.

Digital Library

[23]

Youngwoo Yoon, Woo-Ri Ko, Minsu Jang, Jaeyeon Lee, Jaehong Kim, and Geehyuk Lee. 2019. Robots Learn Social Skills: End-to-End Learning of Co-Speech Gesture Generation for Humanoid Robots. In Proc. of The International Conference in Robotics and Automation (ICRA).

Digital Library

[24]

Youngwoo Yoon, Pieter Wolfert, Taras Kucherenko, Carla Viegas, Teodor Nikolov, Mihail Tsakov, and Gustav Eje Henter. 2022. The GENEA Challenge 2022: A large evaluation of data-driven co-speech gesture generation. In Proceedings of the ACM International Conference on Multimodal Interaction(ICMI ’22). ACM.

Digital Library

[25]

Guanlong Zhao, Shaojin Ding, and Ricardo Gutierrez-Osuna. 2019. Foreign Accent Conversion by Synthesizing Speech from Phonetic Posteriorgrams. In Proc. Interspeech 2019. 2843–2847. https://doi.org/10.21437/Interspeech.2019-1778

Cited By

Wolfert PHenter GBelpaeme T(2024)Exploring the Effectiveness of Evaluation Practices for Computer-Generated Nonverbal BehaviourApplied Sciences10.3390/app1404146014:4(1460)Online publication date: 10-Feb-2024
https://doi.org/10.3390/app14041460
Tonoli RCosta PMarques LUeda L(2024)Gesture Area Coverage to Assess Gesture Expressiveness and Human-LikenessCompanion Proceedings of the 26th International Conference on Multimodal Interaction10.1145/3686215.3688822(165-169)Online publication date: 4-Nov-2024
https://dl.acm.org/doi/10.1145/3686215.3688822
Li DSohn SZhang SChang CKapadia M(2024)From Words to Worlds: Transforming One-line Prompts into Multi-modal Digital Stories with LLM AgentsProceedings of the 17th ACM SIGGRAPH Conference on Motion, Interaction, and Games10.1145/3677388.3696321(1-12)Online publication date: 21-Nov-2024
https://dl.acm.org/doi/10.1145/3677388.3696321
Show More Cited By

Index Terms

The IVI Lab entry to the GENEA Challenge 2022 – A Tacotron2 Based Method for Co-Speech Gesture Generation With Locality-Constraint Attention Mechanism
1. Computing methodologies
  1. Computer graphics
    1. Animation
  2. Machine learning
2. Human-centered computing
  1. Human computer interaction (HCI)
    1. Interaction paradigms
      1. Virtual reality

Recommendations

The GENEA Challenge 2022: A large evaluation of data-driven co-speech gesture generation
ICMI '22: Proceedings of the 2022 International Conference on Multimodal Interaction

This paper reports on the second GENEA Challenge to benchmark data-driven automatic co-speech gesture generation. Participating teams used the same speech and motion dataset to build gesture-generation systems. Motion generated by all these systems was ...
Exemplar-based Stylized Gesture Generation from Speech: An Entry to the GENEA Challenge 2022
ICMI '22: Proceedings of the 2022 International Conference on Multimodal Interaction

We present our entry to the GENEA Challenge of 2022 on data-driven co-speech gesture generation. Our system is a neural network that generates gesture animation from an input audio file. The motion style generated by the model is extracted from an ...
The DeepMotion entry to the GENEA Challenge 2022
ICMI '22: Proceedings of the 2022 International Conference on Multimodal Interaction

This paper describes the method and evaluation results of our DeepMotion entry to the GENEA Challenge 2022. One difficulty in data-driven gesture synthesis is that there may be multiple viable gesture motions for the same speech utterance. Therefore ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ICMI '22: Proceedings of the 2022 International Conference on Multimodal Interaction

November 2022

830 pages

ISBN:9781450393904

DOI:10.1145/3536221

Editors:
Raj Tumuluri
Openstream
,
Nicu Sebe
University of Trento
,
Gopal Pingali
Accenture
,
Dinesh Babu Jayagopi
IIIT Bangalore
,
Abhinav Dhall
IIT Ropar
,
Richa Singh
IIT Jodhpur
,
Lisa Anthony
University of Florida
,
Albert Ali Salah
Utrecht University and Boğaziçi University

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGCHI: ACM Special Interest Group on Computer-Human Interaction

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 November 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

National Science Foundation

Conference

ICMI '22

Sponsor:

SIGCHI

ICMI '22: INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION

November 7 - 11, 2022

Bengaluru, India

Acceptance Rates

Overall Acceptance Rate 453 of 1,080 submissions, 42%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

19
Total Citations
View Citations
582
Total Downloads

Downloads (Last 12 months)219
Downloads (Last 6 weeks)24

Reflects downloads up to 10 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Wolfert PHenter GBelpaeme T(2024)Exploring the Effectiveness of Evaluation Practices for Computer-Generated Nonverbal BehaviourApplied Sciences10.3390/app1404146014:4(1460)Online publication date: 10-Feb-2024
https://doi.org/10.3390/app14041460
Tonoli RCosta PMarques LUeda L(2024)Gesture Area Coverage to Assess Gesture Expressiveness and Human-LikenessCompanion Proceedings of the 26th International Conference on Multimodal Interaction10.1145/3686215.3688822(165-169)Online publication date: 4-Nov-2024
https://dl.acm.org/doi/10.1145/3686215.3688822
Li DSohn SZhang SChang CKapadia M(2024)From Words to Worlds: Transforming One-line Prompts into Multi-modal Digital Stories with LLM AgentsProceedings of the 17th ACM SIGGRAPH Conference on Motion, Interaction, and Games10.1145/3677388.3696321(1-12)Online publication date: 21-Nov-2024
https://dl.acm.org/doi/10.1145/3677388.3696321
Kucherenko TWolfert PYoon YViegas CNikolov TTsakov MHenter G(2024)Evaluating Gesture Generation in a Large-scale Open Challenge: The GENEA Challenge 2022ACM Transactions on Graphics10.1145/365637443:3(1-28)Online publication date: 27-Apr-2024
https://dl.acm.org/doi/10.1145/3656374
Jang JYoon Y(2024)Beyond Words: Enhancing Natural Interaction by Recognizing Social Conversation Contexts in HRI2024 21st International Conference on Ubiquitous Robots (UR)10.1109/UR61395.2024.10597523(669-672)Online publication date: 24-Jun-2024
https://doi.org/10.1109/UR61395.2024.10597523
Xue HYang SZhang ZWu ZLi MDai ZMeng H(2024)Conversational Co-Speech Gesture Generation via Modeling Dialog Intention, Emotion, and Context with Diffusion ModelsICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP48485.2024.10448208(8296-8300)Online publication date: 14-Apr-2024
https://doi.org/10.1109/ICASSP48485.2024.10448208
Yang JBao F(2024)MDG:Multilingual Co-speech Gesture Generation with Low-level Audio Representation and Diffusion Models2024 International Conference on Asian Language Processing (IALP)10.1109/IALP63756.2024.10661182(210-215)Online publication date: 4-Aug-2024
https://doi.org/10.1109/IALP63756.2024.10661182
Chang CLi DPatel DGoel PZhou HMoon SSohn SYoon SPavlovic VKapadia M(2024)Learning from Synthetic Human Group Activities2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52733.2024.02070(21922-21932)Online publication date: 16-Jun-2024
https://doi.org/10.1109/CVPR52733.2024.02070
Yang SWu ZLi MZhang ZHao LBao WCheng MXiao LElkind E(2023)DiffuseStyleGestureProceedings of the Thirty-Second International Joint Conference on Artificial Intelligence10.24963/ijcai.2023/650(5860-5868)Online publication date: 19-Aug-2023
https://dl.acm.org/doi/10.24963/ijcai.2023/650
Wolfert PHenter GBelpaeme T(2023)“Am I listening?”, Evaluating the Quality of Generated Data-driven Listening MotionCompanion Publication of the 25th International Conference on Multimodal Interaction10.1145/3610661.3617160(6-10)Online publication date: 9-Oct-2023
https://dl.acm.org/doi/10.1145/3610661.3617160
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents