research-article

Evaluation of Speech-to-Gesture Generation Using Bi-Directional LSTM Network

Authors:

Shinichi Shirakawa,

Hiroshi Sakuta,

Kazuhiko SumiAuthors Info & Claims

IVA '18: Proceedings of the 18th International Conference on Intelligent Virtual Agents

Pages 79 - 86

https://doi.org/10.1145/3267851.3267878

Published: 05 November 2018 Publication History

Abstract

We present a novel framework to automatically generate natural gesture motions accompanying speech from audio utterances. Based on a Bi-Directional LSTM Network, our deep network learns speech-gesture relationships with both backward and forward consistencies over a long period of time. Our network regresses a full 3D skeletal pose of a human from perceptual features extracted from the input audio in each time step. Then, we apply combined temporal filters to smooth out the generated pose sequences. We utilize a speech-gesture dataset recorded with a headset and marker-based motion capture to train our network. We validated our approach with a subjective evaluation and compared it against "original" human gestures and "mismatched" human gestures taken from a different utterance. The evaluation result shows that our generated gestures are significantly better than the "mismatched" gestures with respect to time consistency. The generated gesture also shows marginally significant improvement in terms of semantic consistency when compared to "mismatched" gestures.

References

[1]

Paavo Alku, Tom Bäckström, and Erkki Vilkman. 2002. Normalized amplitude quotient for parametrization of the glottal flow. The Journal of the Acoustical Society of America 112, 2 (2002), 701--710.

[2]

Timothy W Bickmore, Laura M Pfeifer, Donna Byron, Shaula Forsythe, Lori E Henault, Brian W Jack, Rebecca Silliman, and Michael K Paasche-Orlow. 2010. Usability of conversational agents by patients with inadequate health literacy: Evidence from two clinical trials. Journal of Health Communication 15, S2 (2010), 197--210.

[3]

Géry Casiez, Nicolas Roussel, and Daniel Vogel. 2012. 1€ filter: A simple speed-based low-pass filter for noisy input in interactive systems. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI). 2527--2530.

Digital Library

[4]

Justine Cassell. 2000. Embodied conversational agents. MIT press.

[5]

Justine Cassell, Stefan Kopp, Paul Tepper, Kim Ferriman, and Kristina Striegnitz. 2007. Trading spaces: How humans and humanoids use speech and gesture to give directions. Conversational Informatics (2007), 133--160.

[6]

Justine Cassell, Hannes Högni Vilhjálmsson, and Timothy Bickmore. 2001. BEAT: the behavior expression animation toolkit. In Proceedings of the SIGGRAPH Conference on Computer Graphics and Interactive Techniques (SIGGRAPH). 477--486.

Digital Library

[7]

Chung-Cheng Chiu and Stacy Marsella. 2011. How to train your avatar: A data driven approach to gesture generation. In Proceedings of the International Conference on Intelligent Virtual Agents (IVA). 127--140.

Digital Library

[8]

Chung-Cheng Chiu, Louis-Philippe Morency, and Stacy Marsella. 2015. Predicting co-verbal gestures: a deep and temporal modeling approach. In Proceedings of the International Conference on Intelligent Virtual Agents (IVA). 152--166.

[9]

Steven Davis and Paul Mermelstein. 1980. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Transactions on Acoustics, Speech, and Signal Processing 28, 4 (1980), 357--366.

[10]

Rosalind Edwards and Janet Holland. 2013. What is qualitative interviewing? Bloomsbury Academic.

[11]

Moataz El Ayadi, Mohamed S Kamel, and Fakhri Karray. 2011. Survey on speech emotion recognition: Features, classification schemes, and databases. Pattern Recognition 44, 3 (2011), 572--587.

Digital Library

[12]

Rui Fang, Malcolm Doering, and Joyce Y Chai. 2015. Embodied collaborative referring expression generation in situated human-robot interaction. In Proceedings of the ACM/IEEE International Conference on Human-Robot Interaction (HRI). 271--278.

Digital Library

[13]

Alex Graves and Jürgen Schmidhuber. 2005. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Networks 18, 5 (2005), 602--610.

Digital Library

[14]

Cheolho Han, Sang-Woo Lee, Yujung Heo, Wooyoung Kang, Jaehyun Jun, and Byoung-Tak Zhang. 2017. Criteria for human-compatible AI in two-player vision-language tasks. In Proceedings of the Linguistic And Cognitive Approaches To Dialog Agents (LaCATODA), Workshop of the 26th International Joint Conference on Artificial Intelligence (IJCAI 2017). 28--33.

[15]

Awni Hannun, Carl Case, Jared Casper, Bryan Catanzaro, Greg Diamos, Erich Elsen, Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, et al. 2014. Deep speech: Scaling up end-to-end speech recognition. arXiv preprint arXiv:1412.5567.

[16]

Sergey Ioffe and Christian Szegedy. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the International Conference on Machine Learning(ICML). 448--456.

Digital Library

[17]

Sam Johnson and Mark Everingham. 2010. Clustered pose and nonlinear appearance models for human pose estimation. In Proceedings of the British Machine Vision Conference (BMVC). 12.1--12.11.

[18]

Adam Kendon. 1980. Gesticulation and speech: Two aspects of theprocess of utterance. The relationship of verbal and nonverbal communication 25, 1980 (1980), 207--227.

[19]

Diederik Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In Proceedings of the International Conference on Learning Representations (ICLR).

[20]

A Lawson, Pavel Vabishchevich, M Huggins, P Ardis, Brandon Battles, and A Stauffer. 2011. Survey and evaluation of acoustic features for speaker recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing(ICASSP). 5444--5447.

[21]

David McNeill. 1992. Hand and mind: What gestures reveal about thought. University of Chicago press.

[22]

Vinod Nair and Geoffrey E Hinton. 2010. Rectified linear units improve restricted boltzmann machines. In Proceedings of the International Conference on Machine Learning(ICML). 807--814.

Digital Library

[23]

Brandon Rohrer, Susan Fasoli, Hermano Igo Krebs, Richard Hughes, Bruce Volpe, Walter R Frontera, Joel Stein, and Neville Hogan. 2002. Movement smoothness changes during stroke recovery. Journal of Neuroscience 22, 18 (2002), 8297--8304.

[24]

Mike Schuster and Kuldip K Paliwal. 1997. Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing 45, 11 (1997), 2673--2681.

Digital Library

[25]

Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15, 1 (2014), 1929--1958.

Digital Library

[26]

Kenta Takeuchi, Souichirou Kubota, Keisuke Suzuki, Dai Hasegawa, and Hiroshi Sakuta. 2017. Creating a gesture-speech dataset for speech-based automatic gesture generation. In Proceedings of the International Conference on Human-Computer Interaction (HCI). 198--202.

[27]

Jonathan Tompson, Ross Goroshin, Arjun Jain, Yann LeCun, and Christoph Bregler. 2015. Efficient object localization using convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 648--656.

Cited By

Favali FSchmuck VVillani VCeliktutan O(2025)Towards More Expressive Human-Robot Interactions: Combining Latent Representations and Diffusion Models for Co-speech Gesture GenerationHuman-Friendly Robotics 202410.1007/978-3-031-81688-8_3(30-44)Online publication date: 26-Feb-2025
https://doi.org/10.1007/978-3-031-81688-8_3
Fernández-Rodicio ECastro-González ÁGamboa-Montero JCarrasco-Martínez SSalichs M(2024)Creating Expressive Social Robots That Convey Symbolic and Spontaneous CommunicationSensors10.3390/s2411367124:11(3671)Online publication date: 5-Jun-2024
https://doi.org/10.3390/s24113671
Nakano YNihei FIshii RHigashinaka R(2024)Selecting Iconic Gesture Forms Based on Typical Entity ImagesJournal of Information Processing10.2197/ipsjjip.32.19632(196-205)Online publication date: 2024
https://doi.org/10.2197/ipsjjip.32.196
Show More Cited By

Index Terms

Evaluation of Speech-to-Gesture Generation Using Bi-Directional LSTM Network
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Discourse, dialogue and pragmatics
  2. Machine learning
    1. Machine learning approaches
      1. Neural networks
2. Human-centered computing
  1. Human computer interaction (HCI)
    1. Empirical studies in HCI

Recommendations

Gesticulator: A framework for semantically-aware speech-driven gesture generation
ICMI '20: Proceedings of the 2020 International Conference on Multimodal Interaction

During speech, people spontaneously gesticulate, which plays a key role in conveying information. Similarly, realistic co-speech gestures are crucial to enable natural and smooth interactions with social agents. Current end-to-end co-speech gesture ...
Speech-to-Gesture Generation: A Challenge in Deep Learning Approach with Bi-Directional LSTM
HAI '17: Proceedings of the 5th International Conference on Human Agent Interaction

In this research, we take a first step in generating motion data for gestures directly from speech features. Such a method can make creating gesture animations for Embodied Conversational Agents much easier. We implemented a model using Bi-Directional ...
Evaluation of text-to-gesture generation model using convolutional neural network
Abstract
Conversational gestures have a crucial role in realizing natural interactions with virtual agents and robots. Data-driven approaches, such as deep learning and machine learning, are promising in constructing the gesture generation ...
Highlights
- The quality of text-to-gesture generation models is evaluated by human perceptual studies.

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

IVA '18: Proceedings of the 18th International Conference on Intelligent Virtual Agents

November 2018

381 pages

ISBN:9781450360135

DOI:10.1145/3267851

Conference Chairs:
Anton Bogdanovych
Western Sydney University
,
Deborah Richards
Macquarie University
,
Simeon Simoff
Western Sydney University
,
Program Chairs:
Catherine Pelachaud
CNRS - ISIR, Université Pierre et Marie Curie
,
Dirk Heylen
University of Twente
,
Tomas Trescak
Western Sydney University

Copyright © 2018 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGAI: ACM Special Interest Group on Artificial Intelligence

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 05 November 2018

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

IVA '18

Sponsor:

SIGAI

IVA '18: International Conference on Intelligent Virtual Agents

November 5 - 8, 2018

NSW, Sydney, Australia

Acceptance Rates

IVA '18 Paper Acceptance Rate 17 of 82 submissions, 21%;

Overall Acceptance Rate 53 of 196 submissions, 27%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

97
Total Citations
View Citations
672
Total Downloads

Downloads (Last 12 months)65
Downloads (Last 6 weeks)3

Reflects downloads up to 25 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Favali FSchmuck VVillani VCeliktutan O(2025)Towards More Expressive Human-Robot Interactions: Combining Latent Representations and Diffusion Models for Co-speech Gesture GenerationHuman-Friendly Robotics 202410.1007/978-3-031-81688-8_3(30-44)Online publication date: 26-Feb-2025
https://doi.org/10.1007/978-3-031-81688-8_3
Fernández-Rodicio ECastro-González ÁGamboa-Montero JCarrasco-Martínez SSalichs M(2024)Creating Expressive Social Robots That Convey Symbolic and Spontaneous CommunicationSensors10.3390/s2411367124:11(3671)Online publication date: 5-Jun-2024
https://doi.org/10.3390/s24113671
Nakano YNihei FIshii RHigashinaka R(2024)Selecting Iconic Gesture Forms Based on Typical Entity ImagesJournal of Information Processing10.2197/ipsjjip.32.19632(196-205)Online publication date: 2024
https://doi.org/10.2197/ipsjjip.32.196
Aicher AMatsuda YYasumoto KMinker WAndré EUltes S(2024)Exploring the Impact of Non-Verbal Virtual Agent Behavior on User Engagement in Argumentative DialoguesProceedings of the 12th International Conference on Human-Agent Interaction10.1145/3687272.3688315(224-232)Online publication date: 24-Nov-2024
https://dl.acm.org/doi/10.1145/3687272.3688315
Cheng QLi XFu X(2024)SIGGesture: Generalized Co-Speech Gesture Synthesis via Semantic Injection with Large-Scale Pre-Training Diffusion ModelsSIGGRAPH Asia 2024 Conference Papers10.1145/3680528.3687677(1-11)Online publication date: 3-Dec-2024
https://dl.acm.org/doi/10.1145/3680528.3687677
Younsi NPelachaud CChaby L(2024)Diffusion models for virtual agent facial expression generation in Motivational interviewingProceedings of the 2024 International Conference on Advanced Visual Interfaces10.1145/3656650.3656673(1-5)Online publication date: 3-Jun-2024
https://dl.acm.org/doi/10.1145/3656650.3656673
Boudin ADerban ID’Ambra APergandi JBlache POchs M(2024)OUTCOME Virtual: a tool for automatically creating virtual character animations’ from human videosProceedings of the 24th ACM International Conference on Intelligent Virtual Agents10.1145/3652988.3696196(1-3)Online publication date: 16-Sep-2024
https://dl.acm.org/doi/10.1145/3652988.3696196
Deichler AAlexanderson SBeskow J(2024)Incorporating Spatial Awareness in Data-Driven Gesture Generation for Virtual AgentsProceedings of the 24th ACM International Conference on Intelligent Virtual Agents10.1145/3652988.3673936(1-4)Online publication date: 16-Sep-2024
https://dl.acm.org/doi/10.1145/3652988.3673936
Zeng JTakahashi YNakano YSakato TVilhjálmsson H(2024)Modifying Gesture Style with Impression WordsProceedings of the 24th ACM International Conference on Intelligent Virtual Agents10.1145/3652988.3673931(1-9)Online publication date: 16-Sep-2024
https://dl.acm.org/doi/10.1145/3652988.3673931
Kebe GBirlikci MBoudin AIshii RGirard JMorency L(2024)GeSTICS: A Multimodal Corpus for Studying Gesture Synthesis in Two-party Interactions with Contextualized SpeechProceedings of the 24th ACM International Conference on Intelligent Virtual Agents10.1145/3652988.3673917(1-10)Online publication date: 16-Sep-2024
https://dl.acm.org/doi/10.1145/3652988.3673917
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten