research-article

ACT2G: Attention-based Contrastive Learning for Text-to-Gesture Generation

Authors:

Hitoshi Teshima,

Yuta Nakashima,

Hiroshi Kawasaki,

Katsushi IkeuchiAuthors Info & Claims

Proceedings of the ACM on Computer Graphics and Interactive Techniques, Volume 6, Issue 3

Article No.: 35, Pages 1 - 17

https://doi.org/10.1145/3606940

Published: 24 August 2023 Publication History

Abstract

Recent increase of remote-work, online meeting and tele-operation task makes people find that gesture for avatars and communication robots is more important than we have thought. It is one of the key factors to achieve smooth and natural communication between humans and AI systems and has been intensively researched. Current gesture generation methods are mostly based on deep neural network using text, audio and other information as the input, however, they generate gestures mainly based on audio, which is called a beat gesture. Although the ratio of the beat gesture is more than 70% of actual human gestures, content based gestures sometimes play an important role to make avatars more realistic and human-like. In this paper, we propose a attention-based contrastive learning for text-to-gesture (ACT2G), where generated gestures represent content of the text by estimating attention weight for each word from the input text. In the method, since text and gesture features calculated by the attention weight are mapped to the same latent space by contrastive learning, once text is given as input, the network outputs a feature vector which can be used to generate gestures related to the content. User study confirmed that the gestures generated by ACT2G were better than existing methods. In addition, it was demonstrated that wide variation of gestures were generated from the same text by changing attention weights by creators.

References

[1]

2014. Auto-Encoding Variational Bayes. In 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings. arXiv:http://arxiv.org/abs/1312.6114v10 [stat.ML]

[2]

Artem Abzaliev, Andrew Owens, and Rada Mihalcea. 2022. Towards Understanding the Relation between Gestures and Language. In Proceedings of the 29th International Conference on Computational Linguistics. International Committee on Computational Linguistics, Gyeongju, Republic of Korea, 5507--5520. https://aclanthology.org/2022.coling-1.488

[3]

Simon Alexanderson, Gustav Henter, Taras Kucherenko, and Jonas Beskow. 2020. Style-Controllable Speech-Driven Gesture Synthesis Using Normalising Flows. Computer Graphics Forum 39 (05 2020), 487--496. https://doi.org/10.1111/cgf.13946

[4]

Tenglong Ao, Qingzhe Gao, Yuke Lou, Baoquan Chen, and Libin Liu. 2022. Rhythmic Gesticulator: Rhythm-Aware Co-Speech Gesture Synthesis with Hierarchical Neural Embeddings. 41, 6, Article 209 (nov 2022), 19 pages. https://doi.org/10.1145/3550454.3555435

Digital Library

[5]

Tenglong Ao, Zeyi Zhang, and Libin Liu. [n. d.]. GestureDiffuCLIP: Gesture Diffusion Model with CLIP Latents. ACM Trans. Graph. ([n.d.]), 18 pages. https://doi.org/10.1145/3592097

Digital Library

[6]

Max Bain, Arsha Nagrani, Gül Varol, and Andrew Zisserman. 2021. Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval. In IEEE International Conference on Computer Vision.

[7]

Uttaran Bhattacharya, Nicholas Rewkowski, Abhishek Banerjee, Pooja Guhan, Aniket Bera, and Dinesh Manocha. 2021. Text2Gestures: A Transformer-Based Network for Generating Emotive Body Gestures for Virtual Agents This work has been supported in part by ARO Grants W911NF1910069 and W911NF1910315, and Intel. Code and additional materials available at: https://gamma.umd.edu/t2g. 1--10. https://doi.org/10.1109/VR50410.2021.00037

[8]

Ray L. Birdwhistell. 2010. Kinesics and Context. University of Pennsylvania Press, Philadelphia. https://doi.org/

[9]

Z. Cao, G. Hidalgo Martinez, T. Simon, S. Wei, and Y. A. Sheikh. 2019. OpenPose: Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields. IEEE Transactions on Pattern Analysis and Machine Intelligence (2019).

[10]

Vasileios Choutas, Georgios Pavlakos, Timo Bolkart, Dimitrios Tzionas, and Michael J. Black. 2020. Monocular Expressive Body Regression through Body-Driven Attention. In European Conference on Computer Vision (ECCV). https://expose.is.tue.mpg.de

[11]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL.

[12]

Ylva Ferstl and Rachel McDonnell. 2018. IVA: Investigating the use of recurrent motion modelling for speech gesture generation. In IVA '18 Proceedings of the 18th International Conference on Intelligent Virtual Agents. https://trinityspeechgesture.scss.tcd.ie

[13]

Ylva Ferstl, Michael Neff, and Rachel McDonnell. 2020. Adversarial gesture generation with realistic gesture phasing. Computers and Graphics 89 (2020), 117--130. https://doi.org/10.1016/j.cag.2020.04.007

[14]

Ylva Ferstl, Michael Neff, and Rachel McDonnell. 2021. ExpressGesture: Expressive gesture generation from speech through database matching. Computer Animation and Virtual Worlds 32 (2021).

[15]

Nan Gao, Zeyu Zhao, Zhi Zeng, Shuwu Zhang, and Dongdong Weng. 2023. GesGPT: Speech Gesture Synthesis With Text Parsing from GPT. arXiv preprint arXiv:2303.13013 (2023).

[16]

S. Ginosar, A. Bar, G. Kohavi, C. Chan, A. Owens, and J. Malik. 2019. Learning Individual Styles of Conversational Gesture. In Computer Vision and Pattern Recognition (CVPR). IEEE.

[17]

Ikhsanul Habibie, Weipeng Xu, Dushyant Mehta, Lingjie Liu, Hans-Peter Seidel, Gerard Pons-Moll, Mohamed Elgharib, and Christian Theobalt. 2021. Learning Speech-driven 3D Conversational Gestures from Video. In ACM International Conference on Intelligent Virtual Agents (IVA). arXiv:Todo

Digital Library

[18]

Katsushi Ikeuchi, Zhaoyuan Ma, Zengqiang Yan, Shunsuke Kudoh, and Minako Nakamura. 2018. Describing Upper-Body Motions Based on Labanotation for Learning-from-Observation Robots. Int. J. Comput. Vision 126, 12 (dec 2018), 1415--1429. https://doi.org/10.1007/s11263-018-1123-1

Digital Library

[19]

Patrik Jonell, Youngwoo Yoon, Pieter Wolfert, Taras Kucherenko, and Gustav Eje Henter. 2021. HEMVIP: Human Evaluation of Multiple Videos in Parallel. In Proceedings of the 2021 International Conference on Multimodal Interaction (Montréal, QC, Canada) (ICMI '21). Association for Computing Machinery, New York, NY, USA, 707--711. https://doi.org/10.1145/3462244.3479957

Digital Library

[20]

Ryan Kiros, Ruslan Salakhutdinov, and Richard S. Zemel. 2014. Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models. ArXiv abs/1411.2539 (2014).

[21]

Taras Kucherenko, Dai Hasegawa, Gustav Eje Henter, Naoshi Kaneko, and Hedvig Kjellström. 2019. Analyzing Input and Output Representations for Speech-Driven Gesture Generation (IVA '19). Association for Computing Machinery, New York, NY, USA, 97--104. https://doi.org/10.1145/3308532.3329472

Digital Library

[22]

Taras Kucherenko, Dai Hasegawa, Naoshi Kaneko, Gustav Eje Henter, and Hedvig Kjellström. 2021a. Moving Fast and Slow: Analysis of Representations and Post-Processing in Speech-Driven Automatic Gesture Generation. International Journal of Human--Computer Interaction 37, 14 (2021), 1300--1316. https://doi.org/10.1080/10447318.2021.1883883 arXiv:https://doi.org/10.1080/10447318.2021.1883883

[23]

Taras Kucherenko, Patrik Jonell, Sanne van Waveren, Gustav Eje Henter, Simon Alexanderson, Iolanda Leite, and Hedvig Kjellström. 2020. Gesticulator: A framework for semantically-aware speech-driven gesture generation. In Proceedings of the ACM International Conference on Multimodal Interaction.

Digital Library

[24]

Taras Kucherenko, Patrik Jonell, Youngwoo Yoon, Pieter Wolfert, and Gustav Eje Henter. 2021b. A Large, Crowdsourced Evaluation of Gesture Generation Systems on Common Data: The GENEA Challenge 2020. In 26th International Conference on Intelligent User Interfaces (College Station, TX, USA) (IUI '21). Association for Computing Machinery, New York, NY, USA, 11--21. https://doi.org/10.1145/3397481.3450692

Digital Library

[25]

Taras Kucherenko, Patrik Jonell, Youngwoo Yoon, Pieter Wolfert, Zerrin Yumak, and Gustav Henter. 2021c. GENEA Workshop 2021: The 2nd Workshop on Generation and Evaluation of Non-Verbal Behaviour for Embodied Agents. In Proceedings of the 2021 International Conference on Multimodal Interaction (Montréal, QC, Canada) (ICMI '21). Association for Computing Machinery, New York, NY, USA, 872--873. https://doi.org/10.1145/3462244.3480983

Digital Library

[26]

Jing Li, Di Kang, Wenjie Pei, Xuefei Zhe, Ying Zhang, Zhenyu He, and Linchao Bao. 2021. Audio2Gestures: Generating Diverse Gestures from Speech Audio with Conditional Variational Autoencoders. 2021 IEEE/CVF International Conference on Computer Vision (ICCV) (2021), 11273--11282.

[27]

Yuanzhi Liang, Qianyu Feng, Linchao Zhu, Li Hu, Pan Pan, and Yi Yang. 2022. SEEG: Semantic Energized Co-Speech Gesture Generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 10473--10482.

[28]

Haiyang Liu, Naoya Iwamoto, Zihao Zhu, Zhengqing Li, You Zhou, Elif Bozkurt, and Bo Zheng. 2022a. DisCo: Disentangled Implicit Content and Rhythm Learning for Diverse Co-Speech Gestures Synthesis. In Proceedings of the 30th ACM International Conference on Multimedia (Lisboa, Portugal) (MM '22). Association for Computing Machinery, New York, NY, USA, 3764--3773. https://doi.org/10.1145/3503161.3548400

Digital Library

[29]

Haiyang Liu, Zihao Zhu, Naoya Iwamoto, Yichen Peng, Zhengqing Li, You Zhou, Elif Bozkurt, and Bo Zheng. 2022c. BEAT: A Large-Scale Semantic and Emotional Multi-Modal Dataset for Conversational Gestures Synthesis. arXiv preprint arXiv:2203.05297 (2022).

[30]

Xian Liu, Qianyi Wu, Hang Zhou, Yinghao Xu, Rui Qian, Xinyi Lin, Xiaowei Zhou, Wayne Wu, Bo Dai, and Bolei Zhou. 2022b. Learning Hierarchical Cross-Modal Association for Co-Speech Gesture Generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10462--10472.

[31]

Jinhong Lu, Tianhang Liu, Shuzhuang Xu, and Hiroshi Shimodaira. 2021. Double-DCCCAE: Estimation of Body Gestures From Speech Waveform. ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2021), 900--904.

[32]

P. V. Manoj, Kudoh Shunsuke, and Ikeuchi Katsushi. 2009. Keypose and Style Analysis Based on Low-dimensional Representation (Computer Vision and Image Media(CVIM) Vol.2009-CVIM-167).

[33]

Julieta Martinez, Rayat Hossain, Javier Romero, and James J. Little. 2017. A simple yet effective baseline for 3d human pose estimation. In ICCV.

[34]

David Mcneill. 1994. Hand and Mind: What Gestures Reveal About Thought. Bibliovault OAI Repository, the University of Chicago Press 27 (06 1994). https://doi.org/10.2307/1576015

[35]

Antoine Miech, Ivan Laptev, and Josef Sivic. 2018. Learning a Text-Video Embedding from Incomplete and Heterogeneous Data. (04 2018).

[36]

Shenhan Qian, Zhi Tu, YiHao Zhi, Wen Liu, and Shenghua Gao. 2021. Speech Drives Templates: Co-Speech Gesture Synthesis with Learned Templates. International Conference on Computer Vision (ICCV).

[37]

Kenta Takeuchi, Dai Hasegawa, Shinichi Shirakawa, Naoshi Kaneko, Hiroshi Sakuta, and Kazuhiko Sumi. 2017. Speech-to-Gesture Generation: A Challenge in Deep Learning Approach with Bi-Directional LSTM. 365--369. https://doi.org/10.1145/3125739.3132594

Digital Library

[38]

Hitoshi Teshima, Naoki Wake, Diego Thomas, Yuta Nakashima, Hiroshi Kawasaki, and Katsushi Ikeuchi. 2022. Deep Gesture Generation for Social Robots Using Type-Specific Libraries. In 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[39]

Jing Xu, Wei Zhang, Yalong Bai, Qibin Sun, and Tao Mei. 2022. Freeform body motion generation from speech. arXiv preprint arXiv:2203.02291 (2022).

[40]

Youngwoo Yoon, Bok Cha, Joo-Haeng Lee, Minsu Jang, Jaeyeon Lee, Jaehong Kim, and Geehyuk Lee. 2020. Speech Gesture Generation from the Trimodal Context of Text, Audio, and Speaker Identity. ACM Transactions on Graphics 39, 6 (2020).

Digital Library

[41]

Youngwoo Yoon, Woo-Ri Ko, Minsu Jang, Jaeyeon Lee, Jaehong Kim, and Geehyuk Lee. 2019. Robots Learn Social Skills: End-to-End Learning of Co-Speech Gesture Generation for Humanoid Robots. In Proc. of The International Conference in Robotics and Automation (ICRA).

Digital Library

[42]

Youngwoo Yoon, Pieter Wolfert, Taras Kucherenko, Carla Viegas, Teodor Nikolov, Mihail Tsakov, and Gustav Eje Henter. 2022. The GENEA Challenge 2022: A Large Evaluation of Data-Driven Co-Speech Gesture Generation. In Proceedings of the 2022 International Conference on Multimodal Interaction (Bengaluru, India) (ICMI '22). Association for Computing Machinery, New York, NY, USA, 736--747. https://doi.org/10.1145/3536221.3558058

Digital Library

Index Terms

ACT2G: Attention-based Contrastive Learning for Text-to-Gesture Generation
1. Computing methodologies
  1. Artificial intelligence
  2. Machine learning
2. Human-centered computing
  1. Human computer interaction (HCI)

Index terms have been assigned to the content through auto-classification.

Recommendations

Eye, Head and Torso Coordination During Gaze Shifts in Virtual Reality

Humans perform gaze shifts naturally through a combination of eye, head and body movements. Although gaze has been long studied as input modality for interaction, this has previously ignored the coordination of the eyes, head and body. This article ...
User centered gesture development for smart lighting
HCIK '16: Proceedings of HCI Korea

The aim of this study is to investigate hand gesture expression and to understand it when controlling smart lighting system. The technology development has brought us the smart device which we can control multifunction of one or more systems. In order ...
“Just Like Blooming Fireworks, And Match With Function Perfectly”: Explore and Evaluate User-Defined One-Handed Gestures of Smartwatch
CHI EA '23: Extended Abstracts of the 2023 CHI Conference on Human Factors in Computing Systems

One-handed gesture interaction is a more convenient input interaction method on smartwatches for some special scenarios, e.g. wearing a smartwatch when running or biking. To explore user-friendly one-handed gestures, what users are thinking when using ...

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the ACM on Computer Graphics and Interactive Techniques

Proceedings of the ACM on Computer Graphics and Interactive Techniques Volume 6, Issue 3

August 2023

403 pages

EISSN:2577-6193

DOI:10.1145/3617582

Issue’s Table of Contents

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 August 2023

Published in PACMCGIT Volume 6, Issue 3

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

JSPS/KAKENHI

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
99
Total Downloads

Downloads (Last 12 months)99
Downloads (Last 6 weeks)6

Reflects downloads up to 10 Aug 2024

Other Metrics

View Author Metrics

Citations

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents