research-article

The KU-ISPL entry to the GENEA Challenge 2023-A Diffusion Model for Co-speech Gesture generation

Authors:

Hanseok KoAuthors Info & Claims

ICMI '23 Companion: Companion Publication of the 25th International Conference on Multimodal Interaction

Pages 220 - 227

https://doi.org/10.1145/3610661.3616551

Published: 09 October 2023 Publication History

Abstract

This paper describes a diffusion model for co-speech gesture generation presented by KU-ISPL entry of the GENEA Challenge 2023. We formulate the gesture generation problem as a co-speech gesture generation problem and a semantic gesture generation problem, and we focus on solving the co-speech gesture generation problem by denoising diffusion probabilistic model with text, audio, and pre-pose conditions. We use the U-Net with cross-attention architecture as a denoising model, and we propose a gesture autoencoder as a mapping function from the gesture domain to the latent domain. The collective evaluation released by GENEA Challenge 2023 shows that our model successfully generates co-speech gestures. Our system receives a mean human-likeness score of 32.0, a preference-matched score of appropriateness for the main agent speech of 53.6%, and an interlocutor speech appropriateness score of 53.5%. We also conduct an ablation study to measure the effects of the pre-pose. By the results, our system contributes to the co-speech gesture generation for natural interaction.

References

[1]

Simon Alexanderson, Gustav Eje Henter, Taras Kucherenko, and Jonas Beskow. 2020. Style-controllable speech-driven gesture synthesis using normalising flows. In Computer Graphics Forum, Vol. 39. Wiley Online Library, 487–496.

[2]

Simon Alexanderson, Rajmund Nagy, Jonas Beskow, and Gustav Eje Henter. 2022. Listen, denoise, action! audio-driven motion synthesis with diffusion models. arXiv preprint arXiv:2211.09707 (2022).

[3]

Tenglong Ao, Zeyi Zhang, and Libin Liu. 2023. GestureDiffuCLIP: Gesture diffusion model with CLIP latents. arXiv preprint arXiv:2303.14613 (2023).

[4]

Che-Jui Chang, Sen Zhang, and Mubbasir Kapadia. 2022. The IVI Lab entry to the GENEA Challenge 2022–A Tacotron2 based method for co-speech gesture generation with locality-constraint attention mechanism. In Proceedings of the 2022 International Conference on Multimodal Interaction. 784–789.

Digital Library

[5]

Xin Chen, Biao Jiang, Wen Liu, Zilong Huang, Bin Fu, Tao Chen, and Gang Yu. 2023. Executing your Commands via Motion Diffusion in Latent Space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18000–18010.

[6]

Kyunghyun Cho, Bart van Merriënboer, Çağlar Gulçehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). 1724–1734.

[7]

Prafulla Dhariwal and Alexander Nichol. 2021. Diffusion models beat gans on image synthesis. Advances in neural information processing systems 34 (2021), 8780–8794.

[8]

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2020. Generative adversarial networks. Commun. ACM 63, 11 (2020), 139–144.

Digital Library

[9]

F Sebastian Grassia. 1998. Practical parameterization of rotations using the exponential map. Journal of graphics tools 3, 3 (1998), 29–48.

Digital Library

[10]

Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki Parmar, Yu Zhang, Jiahui Yu, Wei Han, Shibo Wang, Zhengdong Zhang, Yonghui Wu, 2020. Conformer: Convolution-augmented transformer for speech recognition. arXiv preprint arXiv:2005.08100 (2020).

[11]

Dan Hendrycks and Kevin Gimpel. 2016. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415 (2016).

[12]

Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising diffusion probabilistic models. Advances in neural information processing systems 33 (2020), 6840–6851.

[13]

Yannick Jadoul, Bill Thompson, and Bart De Boer. 2018. Introducing parselmouth: A python interface to praat. Journal of Phonetics 71 (2018), 1–15.

[14]

Yifan Jiang, Han Chen, and Hanseok Ko. 2023. Spatial-temporal Transformer-guided Diffusion based Data Augmentation for Efficient Skeleton-based Action Recognition. arXiv preprint arXiv:2302.13434 (2023).

[15]

Gwantae Kim, Seonghyeok Noh, Insung Ham, and Hanseok Ko. 2023. MPE4G: Multimodal Pretrained Encoder for Co-Speech Gesture Generation. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 1–5.

[16]

Gwantae Kim, Youngsuk Ryu, Junyeop Lee, David K Han, Jeongmin Bae, and Hanseok Ko. 2022. 3d human motion generation from the text via gesture action classification and the autoregressive model. In 2022 IEEE International Conference on Image Processing (ICIP). IEEE, 1036–1040.

[17]

Jihoon Kim, Jiseob Kim, and Sungjoon Choi. 2023. Flame: Free-form language-based motion synthesis & editing. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 37. 8255–8263.

Digital Library

[18]

Taras Kucherenko, Dai Hasegawa, Gustav Eje Henter, Naoshi Kaneko, and Hedvig Kjellström. 2019. Analyzing input and output representations for speech-driven gesture generation. In Proceedings of the 19th ACM International Conference on Intelligent Virtual Agents. 97–104.

Digital Library

[19]

Taras Kucherenko, Patrik Jonell, Sanne Van Waveren, Gustav Eje Henter, Simon Alexandersson, Iolanda Leite, and Hedvig Kjellström. 2020. Gesticulator: A framework for semantically-aware speech-driven gesture generation. In Proceedings of the 2020 international conference on multimodal interaction. 242–250.

Digital Library

[20]

Taras Kucherenko, Rajmund Nagy, Youngwoo Yoon, Jieyeon Woo, Teodor Nikolov, Mihail Tsakov, and Gustav Eje Henter. 2023. The GENEA Challenge 2023: A large-scale evaluation of gesture generation models in monadic and dyadic settings. In Proceedings of the ACM International Conference on Multimodal Interaction(ICMI ’23). ACM.

Digital Library

[21]

Gilwoo Lee, Zhiwei Deng, Shugao Ma, Takaaki Shiratori, Siddhartha S Srinivasa, and Yaser Sheikh. 2019. Talking with hands 16.2 m: A large-scale dataset of synchronized body-finger motion and audio for conversational motion analysis and synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 763–772.

[22]

Jing Li, Di Kang, Wenjie Pei, Xuefei Zhe, Ying Zhang, Zhenyu He, and Linchao Bao. 2021. Audio2gestures: Generating diverse gestures from speech audio with conditional variational autoencoders. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 11293–11302.

[23]

Yuanzhi Liang, Qianyu Feng, Linchao Zhu, Li Hu, Pan Pan, and Yi Yang. 2022. Seeg: Semantic energized co-speech gesture generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10473–10482.

[24]

Songxiang Liu, Yuewen Cao, Dan Su, and Helen Meng. 2021. Diffsvc: A diffusion probabilistic model for singing voice conversion. In 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 741–748.

[25]

Xian Liu, Qianyi Wu, Hang Zhou, Yinghao Xu, Rui Qian, Xinyi Lin, Xiaowei Zhou, Wayne Wu, Bo Dai, and Bolei Zhou. 2022. Learning hierarchical cross-modal association for co-speech gesture generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10462–10472.

[26]

Jianwen Luo, Kui Ying, and Jing Bai. 2005. Savitzky–Golay smoothing and differentiation filter for even number data. Signal processing 85, 7 (2005), 1429–1434.

[27]

Brian McFee, Colin Raffel, Dawen Liang, Daniel P Ellis, Matt McVicar, Eric Battenberg, and Oriol Nieto. 2015. librosa: Audio and music signal analysis in python. In Proceedings of the 14th python in science conference, Vol. 8. 18–25.

[28]

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2022. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 10684–10695.

[29]

Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15, 1 (2014), 1929–1958.

Digital Library

[30]

Guy Tevet, Sigal Raab, Brian Gordon, Yoni Shafir, Daniel Cohen-or, and Amit Haim Bermano. 2023. Human Motion Diffusion Model. In The Eleventh International Conference on Learning Representations. https://openreview.net/forum?id=SJ1kSyO2jwu

[31]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017).

[32]

Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. 2022. Text embeddings by weakly-supervised contrastive pre-training. arXiv preprint arXiv:2212.03533 (2022).

[33]

Youngwoo Yoon, Bok Cha, Joo-Haeng Lee, Minsu Jang, Jaeyeon Lee, Jaehong Kim, and Geehyuk Lee. 2020. Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics (TOG) 39, 6 (2020), 1–16.

Digital Library

[34]

Youngwoo Yoon, Woo-Ri Ko, Minsu Jang, Jaeyeon Lee, Jaehong Kim, and Geehyuk Lee. 2019. Robots learn social skills: End-to-end learning of co-speech gesture generation for humanoid robots. In Proceedings of 2019 International Conference on Robotics and Automation. IEEE, 4303–4309.

Digital Library

[35]

Youngwoo Yoon, Pieter Wolfert, Taras Kucherenko, Carla Viegas, Teodor Nikolov, Mihail Tsakov, and Gustav Eje Henter. 2022. The GENEA Challenge 2022: A large evaluation of data-driven co-speech gesture generation. In Proceedings of the 2022 International Conference on Multimodal Interaction. 736–747.

Digital Library

[36]

Mingyuan Zhang, Zhongang Cai, Liang Pan, Fangzhou Hong, Xinying Guo, Lei Yang, and Ziwei Liu. 2022. MotionDiffuse: Text-Driven Human Motion Generation with Diffusion Model. arXiv preprint arXiv:2208.15001 (2022).

[37]

Chi Zhou, Tengyue Bian, and Kang Chen. 2022. Gesturemaster: Graph-based speech-driven gesture generation. In Proceedings of the 2022 International Conference on Multimodal Interaction. 764–770.

Digital Library

[38]

Lingting Zhu, Xian Liu, Xuanyu Liu, Rui Qian, Ziwei Liu, and Lequan Yu. 2023. Taming Diffusion Models for Audio-Driven Co-Speech Gesture Generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10544–10553.

Cited By

Kucherenko TNagy RYoon YWoo JNikolov TTsakov MHenter G(2023)The GENEA Challenge 2023: A large-scale evaluation of gesture generation models in monadic and dyadic settingsProceedings of the 25th International Conference on Multimodal Interaction10.1145/3577190.3616120(792-801)Online publication date: 9-Oct-2023
https://dl.acm.org/doi/10.1145/3577190.3616120

Index Terms

The KU-ISPL entry to the GENEA Challenge 2023-A Diffusion Model for Co-speech Gesture generation
1. Computing methodologies
  1. Computer graphics
    1. Animation
2. Human-centered computing
  1. Human computer interaction (HCI)

Recommendations

Discrete Diffusion for Co-Speech Gesture Synthesis
ICMI '23 Companion: Companion Publication of the 25th International Conference on Multimodal Interaction

In this paper, we describe the gesture synthesis system we developed for our entry to the GENEA Challenge 2023. One challenge in learning the co-speech gesture model is that there may be multiple viable gesture motions for the same speech utterance. ...
Augmented Co-Speech Gesture Generation: Including Form and Meaning Features to Guide Learning-Based Gesture Synthesis
IVA '23: Proceedings of the 23rd ACM International Conference on Intelligent Virtual Agents

Due to their significance in human communication, the automatic generation of co-speech gestures in artificial embodied agents has received a lot of attention. Although modern deep learning approaches can generate realistic-looking conversational ...
The IVI Lab entry to the GENEA Challenge 2022 – A Tacotron2 Based Method for Co-Speech Gesture Generation With Locality-Constraint Attention Mechanism
ICMI '22: Proceedings of the 2022 International Conference on Multimodal Interaction

This paper describes the IVI Lab entry to the GENEA Challenge 2022. We formulate the gesture generation problem as a sequence-to-sequence conversion task with text, audio, and speaker identity as inputs and the body motion as the output. We use the ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ICMI '23 Companion: Companion Publication of the 25th International Conference on Multimodal Interaction

October 2023

434 pages

ISBN:9798400703218

DOI:10.1145/3610661

Editors:
Elisabeth André
University of Augsburg
,
Mohamed Chetouani
Sorbonne University
,
Dominique Vaufreydaz
Univ. Grenoble Alpes
,
Gale Lucas
USC Institute for Creative Technologies
,
Tanja Schultz
University of Bremen
,
Louis-Philippe Morency
Carnegie Mellon University
,
Alessandro Vinciarelli
University of Glasgow

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGCHI: ACM Special Interest Group on Computer-Human Interaction

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 October 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

This work was supported by the Development of cognitive/response advancement technology for AI avatar commercialization project funded by the Brand Engagement Network(BEN)

Conference

ICMI '23

Sponsor:

SIGCHI

ICMI '23: INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION

October 9 - 13, 2023

Paris, France

Acceptance Rates

Overall Acceptance Rate 453 of 1,080 submissions, 42%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
59
Total Downloads

Downloads (Last 12 months)59
Downloads (Last 6 weeks)1

Reflects downloads up to 01 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Kucherenko TNagy RYoon YWoo JNikolov TTsakov MHenter G(2023)The GENEA Challenge 2023: A large-scale evaluation of gesture generation models in monadic and dyadic settingsProceedings of the 25th International Conference on Multimodal Interaction10.1145/3577190.3616120(792-801)Online publication date: 9-Oct-2023
https://dl.acm.org/doi/10.1145/3577190.3616120

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Table of Contents