Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3536221.3558060acmconferencesArticle/Chapter ViewAbstractPublication Pagesicmi-mlmiConference Proceedingsconference-collections
research-article
Open access

The IVI Lab entry to the GENEA Challenge 2022 – A Tacotron2 Based Method for Co-Speech Gesture Generation With Locality-Constraint Attention Mechanism

Published: 07 November 2022 Publication History

Abstract

This paper describes the IVI Lab entry to the GENEA Challenge 2022. We formulate the gesture generation problem as a sequence-to-sequence conversion task with text, audio, and speaker identity as inputs and the body motion as the output. We use the Tacotron2 architecture as our backbone with the locality-constraint attention mechanism that guides the decoder to learn the dependencies from the neighboring latent features. The collective evaluation released by GENEA Challenge 2022 indicates that our two entries (FSH and USK) for the full body and upper body tracks statistically outperform the audio-driven and text-driven baselines on both two subjective metrics. Remarkably, our full-body entry receives the highest speech appropriateness (60.5% matched) among all submitted entries. We also conduct an objective evaluation to compare our motion acceleration and jerk with two autoregressive baselines. The result indicates that the motion distribution of our generated gestures is much closer to the distribution of natural gestures.

References

[1]
Simon Alexanderson, Gustav Eje Henter, Taras Kucherenko, and Jonas Beskow. 2020. Style-controllable speech-driven gesture synthesis using normalising flows. Computer Graphics Forum 39, 2 (2020), 487–496. https://doi.org/10.1111/cgf.13946
[2]
Simon Alexanderson, Éva Székely, Gustav Eje Henter, Taras Kucherenko, and Jonas Beskow. 2020. Generating coherent spontaneous speech and gesture from text. In Proceedings of the 20th ACM International Conference on Intelligent Virtual Agents. 1–3.
[3]
Uttaran Bhattacharya, Elizabeth Childs, Nicholas Rewkowski, and Dinesh Manocha. 2021. Speech2affectivegestures: Synthesizing co-speech gestures with generative adversarial affective expression learning. In Proceedings of the 29th ACM International Conference on Multimedia. 2027–2036.
[4]
Uttaran Bhattacharya, Nicholas Rewkowski, Abhishek Banerjee, Pooja Guhan, Aniket Bera, and Dinesh Manocha. 2021. Text2gestures: A transformer-based network for generating emotive body gestures for virtual agents. In 2021 IEEE Virtual Reality and 3D User Interfaces (VR). IEEE, 1–10.
[5]
Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2016. Enriching Word Vectors with Subword Information. arXiv preprint arXiv:1607.04606(2016).
[6]
Che-Jui Chang. 2020. Transfer Learning from Monolingual ASR to Transcription-free Cross-lingual Voice Conversion. https://doi.org/10.48550/ARXIV.2009.14668
[7]
Ylva Ferstl, Michael Neff, and Rachel McDonnell. 2020. Understanding the predictability of gesture parameters from speech and their perceptual importance. In Proceedings of the 20th ACM International Conference on Intelligent Virtual Agents. 1–8.
[8]
Shiry Ginosar, Amir Bar, Gefen Kohavi, Caroline Chan, Andrew Owens, and Jitendra Malik. 2019. Learning individual styles of conversational gesture. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3497–3506.
[9]
F. Sebastin Grassia. 1998. Practical Parameterization of Rotations Using the Exponential Map. J. Graph. Tools 3, 3 (mar 1998), 29–48. https://doi.org/10.1080/10867651.1998.10487493
[10]
Ikhsanul Habibie, Weipeng Xu, Dushyant Mehta, Lingjie Liu, Hans-Peter Seidel, Gerard Pons-Moll, Mohamed Elgharib, and Christian Theobalt. 2021. Learning speech-driven 3d conversational gestures from video. In Proceedings of the 21st ACM International Conference on Intelligent Virtual Agents. 101–108.
[11]
Yannick Jadoul, Bill Thompson, and Bart de Boer. 2018. Introducing Parselmouth: A Python interface to Praat. Journal of Phonetics 71(2018), 1–15. https://doi.org/10.1016/j.wocn.2018.07.001
[12]
Taras Kucherenko, Patrik Jonell, Sanne van Waveren, Gustav Eje Henter, Simon Alexandersson, Iolanda Leite, and Hedvig Kjellström. 2020. Gesticulator: A framework for semantically-aware speech-driven gesture generation. In Proceedings of the 2020 International Conference on Multimodal Interaction. 242–250.
[13]
Gilwoo Lee, Zhiwei Deng, Shugao Ma, Takaaki Shiratori, Siddhartha S. Srinivasa, and Yaser Sheikh. 2019. Talking With Hands 16.2M: A large-scale dataset of synchronized body-finger motion and audio for conversational motion analysis and synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision(ICCV ’19). IEEE, 763–772. https://doi.org/10.1109/ICCV.2019.00085
[14]
Jing Li, Di Kang, Wenjie Pei, Xuefei Zhe, Ying Zhang, Zhenyu He, and Linchao Bao. 2021. Audio2gestures: Generating diverse gestures from speech audio with conditional variational autoencoders. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 11293–11302.
[15]
Yu Liu, Gelareh Mohammadi, Yang Song, and Wafa Johal. 2021. Speech-based Gesture Generation for Robots and Embodied Agents: A Scoping Review. In Proceedings of the 9th International Conference on Human-Agent Interaction. 31–38.
[16]
Daniel P Loehr. 2012. Temporal, structural, and pragmatic synchrony between intonation and gesture. Laboratory phonology 3, 1 (2012), 71–89.
[17]
Jianwen Luo, Kui Ying, and Jing Bai. 2005. Savitzky–Golay smoothing and differentiation filter for even number data. Signal processing 85, 7 (2005), 1429–1434.
[18]
Brian McFee, Colin Raffel, Dawen Liang, Daniel P Ellis, Matt McVicar, Eric Battenberg, and Oriol Nieto. 2015. librosa: Audio and music signal analysis in python. In Proceedings of the 14th python in science conference, Vol. 8. Citeseer, 18–25.
[19]
Wim Pouw and James A Dixon. 2019. Quantifying gesture-speech synchrony. In the 6th gesture and speech in interaction conference. Universitaetsbibliothek Paderborn, 75–80.
[20]
Jonathan Shen, Ruoming Pang, Ron J Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, Rj Skerrv-Ryan, 2018. Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 4779–4783.
[21]
Pieter Wolfert, Nicole Robinson, and Tony Belpaeme. 2022. A review of evaluation practices of gesture generation in embodied conversational agents. IEEE Transactions on Human-Machine Systems(2022).
[22]
Youngwoo Yoon, Bok Cha, Joo-Haeng Lee, Minsu Jang, Jaeyeon Lee, Jaehong Kim, and Geehyuk Lee. 2020. Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics (TOG) 39, 6 (2020), 1–16.
[23]
Youngwoo Yoon, Woo-Ri Ko, Minsu Jang, Jaeyeon Lee, Jaehong Kim, and Geehyuk Lee. 2019. Robots Learn Social Skills: End-to-End Learning of Co-Speech Gesture Generation for Humanoid Robots. In Proc. of The International Conference in Robotics and Automation (ICRA).
[24]
Youngwoo Yoon, Pieter Wolfert, Taras Kucherenko, Carla Viegas, Teodor Nikolov, Mihail Tsakov, and Gustav Eje Henter. 2022. The GENEA Challenge 2022: A large evaluation of data-driven co-speech gesture generation. In Proceedings of the ACM International Conference on Multimodal Interaction(ICMI ’22). ACM.
[25]
Guanlong Zhao, Shaojin Ding, and Ricardo Gutierrez-Osuna. 2019. Foreign Accent Conversion by Synthesizing Speech from Phonetic Posteriorgrams. In Proc. Interspeech 2019. 2843–2847. https://doi.org/10.21437/Interspeech.2019-1778

Cited By

View all
  • (2024)Exploring the Effectiveness of Evaluation Practices for Computer-Generated Nonverbal BehaviourApplied Sciences10.3390/app1404146014:4(1460)Online publication date: 10-Feb-2024
  • (2024)Gesture Area Coverage to Assess Gesture Expressiveness and Human-LikenessCompanion Proceedings of the 26th International Conference on Multimodal Interaction10.1145/3686215.3688822(165-169)Online publication date: 4-Nov-2024
  • (2024)From Words to Worlds: Transforming One-line Prompts into Multi-modal Digital Stories with LLM AgentsProceedings of the 17th ACM SIGGRAPH Conference on Motion, Interaction, and Games10.1145/3677388.3696321(1-12)Online publication date: 21-Nov-2024
  • Show More Cited By

Index Terms

  1. The IVI Lab entry to the GENEA Challenge 2022 – A Tacotron2 Based Method for Co-Speech Gesture Generation With Locality-Constraint Attention Mechanism

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image ACM Conferences
        ICMI '22: Proceedings of the 2022 International Conference on Multimodal Interaction
        November 2022
        830 pages
        ISBN:9781450393904
        DOI:10.1145/3536221
        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Sponsors

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 07 November 2022

        Permissions

        Request permissions for this article.

        Check for updates

        Author Tags

        1. co-speech gesture generation
        2. locality constraint attention
        3. sequence-to-sequence modeling

        Qualifiers

        • Research-article
        • Research
        • Refereed limited

        Funding Sources

        Conference

        ICMI '22
        Sponsor:

        Acceptance Rates

        Overall Acceptance Rate 453 of 1,080 submissions, 42%

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)219
        • Downloads (Last 6 weeks)24
        Reflects downloads up to 10 Nov 2024

        Other Metrics

        Citations

        Cited By

        View all
        • (2024)Exploring the Effectiveness of Evaluation Practices for Computer-Generated Nonverbal BehaviourApplied Sciences10.3390/app1404146014:4(1460)Online publication date: 10-Feb-2024
        • (2024)Gesture Area Coverage to Assess Gesture Expressiveness and Human-LikenessCompanion Proceedings of the 26th International Conference on Multimodal Interaction10.1145/3686215.3688822(165-169)Online publication date: 4-Nov-2024
        • (2024)From Words to Worlds: Transforming One-line Prompts into Multi-modal Digital Stories with LLM AgentsProceedings of the 17th ACM SIGGRAPH Conference on Motion, Interaction, and Games10.1145/3677388.3696321(1-12)Online publication date: 21-Nov-2024
        • (2024)Evaluating Gesture Generation in a Large-scale Open Challenge: The GENEA Challenge 2022ACM Transactions on Graphics10.1145/365637443:3(1-28)Online publication date: 27-Apr-2024
        • (2024)Beyond Words: Enhancing Natural Interaction by Recognizing Social Conversation Contexts in HRI2024 21st International Conference on Ubiquitous Robots (UR)10.1109/UR61395.2024.10597523(669-672)Online publication date: 24-Jun-2024
        • (2024)Conversational Co-Speech Gesture Generation via Modeling Dialog Intention, Emotion, and Context with Diffusion ModelsICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP48485.2024.10448208(8296-8300)Online publication date: 14-Apr-2024
        • (2024)MDG:Multilingual Co-speech Gesture Generation with Low-level Audio Representation and Diffusion Models2024 International Conference on Asian Language Processing (IALP)10.1109/IALP63756.2024.10661182(210-215)Online publication date: 4-Aug-2024
        • (2024)Learning from Synthetic Human Group Activities2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52733.2024.02070(21922-21932)Online publication date: 16-Jun-2024
        • (2023)DiffuseStyleGestureProceedings of the Thirty-Second International Joint Conference on Artificial Intelligence10.24963/ijcai.2023/650(5860-5868)Online publication date: 19-Aug-2023
        • (2023)“Am I listening?”, Evaluating the Quality of Generated Data-driven Listening MotionCompanion Publication of the 25th International Conference on Multimodal Interaction10.1145/3610661.3617160(6-10)Online publication date: 9-Oct-2023
        • Show More Cited By

        View Options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        HTML Format

        View this article in HTML Format.

        HTML Format

        Get Access

        Login options

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media