Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3267851.3267878acmconferencesArticle/Chapter ViewAbstractPublication PagesivaConference Proceedingsconference-collections
research-article

Evaluation of Speech-to-Gesture Generation Using Bi-Directional LSTM Network

Published: 05 November 2018 Publication History

Abstract

We present a novel framework to automatically generate natural gesture motions accompanying speech from audio utterances. Based on a Bi-Directional LSTM Network, our deep network learns speech-gesture relationships with both backward and forward consistencies over a long period of time. Our network regresses a full 3D skeletal pose of a human from perceptual features extracted from the input audio in each time step. Then, we apply combined temporal filters to smooth out the generated pose sequences. We utilize a speech-gesture dataset recorded with a headset and marker-based motion capture to train our network. We validated our approach with a subjective evaluation and compared it against "original" human gestures and "mismatched" human gestures taken from a different utterance. The evaluation result shows that our generated gestures are significantly better than the "mismatched" gestures with respect to time consistency. The generated gesture also shows marginally significant improvement in terms of semantic consistency when compared to "mismatched" gestures.

References

[1]
Paavo Alku, Tom Bäckström, and Erkki Vilkman. 2002. Normalized amplitude quotient for parametrization of the glottal flow. The Journal of the Acoustical Society of America 112, 2 (2002), 701--710.
[2]
Timothy W Bickmore, Laura M Pfeifer, Donna Byron, Shaula Forsythe, Lori E Henault, Brian W Jack, Rebecca Silliman, and Michael K Paasche-Orlow. 2010. Usability of conversational agents by patients with inadequate health literacy: Evidence from two clinical trials. Journal of Health Communication 15, S2 (2010), 197--210.
[3]
Géry Casiez, Nicolas Roussel, and Daniel Vogel. 2012. 1€ filter: A simple speed-based low-pass filter for noisy input in interactive systems. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI). 2527--2530.
[4]
Justine Cassell. 2000. Embodied conversational agents. MIT press.
[5]
Justine Cassell, Stefan Kopp, Paul Tepper, Kim Ferriman, and Kristina Striegnitz. 2007. Trading spaces: How humans and humanoids use speech and gesture to give directions. Conversational Informatics (2007), 133--160.
[6]
Justine Cassell, Hannes Högni Vilhjálmsson, and Timothy Bickmore. 2001. BEAT: the behavior expression animation toolkit. In Proceedings of the SIGGRAPH Conference on Computer Graphics and Interactive Techniques (SIGGRAPH). 477--486.
[7]
Chung-Cheng Chiu and Stacy Marsella. 2011. How to train your avatar: A data driven approach to gesture generation. In Proceedings of the International Conference on Intelligent Virtual Agents (IVA). 127--140.
[8]
Chung-Cheng Chiu, Louis-Philippe Morency, and Stacy Marsella. 2015. Predicting co-verbal gestures: a deep and temporal modeling approach. In Proceedings of the International Conference on Intelligent Virtual Agents (IVA). 152--166.
[9]
Steven Davis and Paul Mermelstein. 1980. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Transactions on Acoustics, Speech, and Signal Processing 28, 4 (1980), 357--366.
[10]
Rosalind Edwards and Janet Holland. 2013. What is qualitative interviewing? Bloomsbury Academic.
[11]
Moataz El Ayadi, Mohamed S Kamel, and Fakhri Karray. 2011. Survey on speech emotion recognition: Features, classification schemes, and databases. Pattern Recognition 44, 3 (2011), 572--587.
[12]
Rui Fang, Malcolm Doering, and Joyce Y Chai. 2015. Embodied collaborative referring expression generation in situated human-robot interaction. In Proceedings of the ACM/IEEE International Conference on Human-Robot Interaction (HRI). 271--278.
[13]
Alex Graves and Jürgen Schmidhuber. 2005. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Networks 18, 5 (2005), 602--610.
[14]
Cheolho Han, Sang-Woo Lee, Yujung Heo, Wooyoung Kang, Jaehyun Jun, and Byoung-Tak Zhang. 2017. Criteria for human-compatible AI in two-player vision-language tasks. In Proceedings of the Linguistic And Cognitive Approaches To Dialog Agents (LaCATODA), Workshop of the 26th International Joint Conference on Artificial Intelligence (IJCAI 2017). 28--33.
[15]
Awni Hannun, Carl Case, Jared Casper, Bryan Catanzaro, Greg Diamos, Erich Elsen, Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, et al. 2014. Deep speech: Scaling up end-to-end speech recognition. arXiv preprint arXiv:1412.5567.
[16]
Sergey Ioffe and Christian Szegedy. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the International Conference on Machine Learning(ICML). 448--456.
[17]
Sam Johnson and Mark Everingham. 2010. Clustered pose and nonlinear appearance models for human pose estimation. In Proceedings of the British Machine Vision Conference (BMVC). 12.1--12.11.
[18]
Adam Kendon. 1980. Gesticulation and speech: Two aspects of theprocess of utterance. The relationship of verbal and nonverbal communication 25, 1980 (1980), 207--227.
[19]
Diederik Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In Proceedings of the International Conference on Learning Representations (ICLR).
[20]
A Lawson, Pavel Vabishchevich, M Huggins, P Ardis, Brandon Battles, and A Stauffer. 2011. Survey and evaluation of acoustic features for speaker recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing(ICASSP). 5444--5447.
[21]
David McNeill. 1992. Hand and mind: What gestures reveal about thought. University of Chicago press.
[22]
Vinod Nair and Geoffrey E Hinton. 2010. Rectified linear units improve restricted boltzmann machines. In Proceedings of the International Conference on Machine Learning(ICML). 807--814.
[23]
Brandon Rohrer, Susan Fasoli, Hermano Igo Krebs, Richard Hughes, Bruce Volpe, Walter R Frontera, Joel Stein, and Neville Hogan. 2002. Movement smoothness changes during stroke recovery. Journal of Neuroscience 22, 18 (2002), 8297--8304.
[24]
Mike Schuster and Kuldip K Paliwal. 1997. Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing 45, 11 (1997), 2673--2681.
[25]
Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15, 1 (2014), 1929--1958.
[26]
Kenta Takeuchi, Souichirou Kubota, Keisuke Suzuki, Dai Hasegawa, and Hiroshi Sakuta. 2017. Creating a gesture-speech dataset for speech-based automatic gesture generation. In Proceedings of the International Conference on Human-Computer Interaction (HCI). 198--202.
[27]
Jonathan Tompson, Ross Goroshin, Arjun Jain, Yann LeCun, and Christoph Bregler. 2015. Efficient object localization using convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 648--656.

Cited By

View all
  • (2025)Towards More Expressive Human-Robot Interactions: Combining Latent Representations and Diffusion Models for Co-speech Gesture GenerationHuman-Friendly Robotics 202410.1007/978-3-031-81688-8_3(30-44)Online publication date: 26-Feb-2025
  • (2024)Creating Expressive Social Robots That Convey Symbolic and Spontaneous CommunicationSensors10.3390/s2411367124:11(3671)Online publication date: 5-Jun-2024
  • (2024)Selecting Iconic Gesture Forms Based on Typical Entity ImagesJournal of Information Processing10.2197/ipsjjip.32.19632(196-205)Online publication date: 2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
IVA '18: Proceedings of the 18th International Conference on Intelligent Virtual Agents
November 2018
381 pages
ISBN:9781450360135
DOI:10.1145/3267851
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 05 November 2018

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. deep learning
  2. gesture generation
  3. long short-term memory
  4. neural networks

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

IVA '18
Sponsor:
IVA '18: International Conference on Intelligent Virtual Agents
November 5 - 8, 2018
NSW, Sydney, Australia

Acceptance Rates

IVA '18 Paper Acceptance Rate 17 of 82 submissions, 21%;
Overall Acceptance Rate 53 of 196 submissions, 27%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)65
  • Downloads (Last 6 weeks)3
Reflects downloads up to 25 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Towards More Expressive Human-Robot Interactions: Combining Latent Representations and Diffusion Models for Co-speech Gesture GenerationHuman-Friendly Robotics 202410.1007/978-3-031-81688-8_3(30-44)Online publication date: 26-Feb-2025
  • (2024)Creating Expressive Social Robots That Convey Symbolic and Spontaneous CommunicationSensors10.3390/s2411367124:11(3671)Online publication date: 5-Jun-2024
  • (2024)Selecting Iconic Gesture Forms Based on Typical Entity ImagesJournal of Information Processing10.2197/ipsjjip.32.19632(196-205)Online publication date: 2024
  • (2024)Exploring the Impact of Non-Verbal Virtual Agent Behavior on User Engagement in Argumentative DialoguesProceedings of the 12th International Conference on Human-Agent Interaction10.1145/3687272.3688315(224-232)Online publication date: 24-Nov-2024
  • (2024)SIGGesture: Generalized Co-Speech Gesture Synthesis via Semantic Injection with Large-Scale Pre-Training Diffusion ModelsSIGGRAPH Asia 2024 Conference Papers10.1145/3680528.3687677(1-11)Online publication date: 3-Dec-2024
  • (2024)Diffusion models for virtual agent facial expression generation in Motivational interviewingProceedings of the 2024 International Conference on Advanced Visual Interfaces10.1145/3656650.3656673(1-5)Online publication date: 3-Jun-2024
  • (2024)OUTCOME Virtual: a tool for automatically creating virtual character animations’ from human videosProceedings of the 24th ACM International Conference on Intelligent Virtual Agents10.1145/3652988.3696196(1-3)Online publication date: 16-Sep-2024
  • (2024)Incorporating Spatial Awareness in Data-Driven Gesture Generation for Virtual AgentsProceedings of the 24th ACM International Conference on Intelligent Virtual Agents10.1145/3652988.3673936(1-4)Online publication date: 16-Sep-2024
  • (2024)Modifying Gesture Style with Impression WordsProceedings of the 24th ACM International Conference on Intelligent Virtual Agents10.1145/3652988.3673931(1-9)Online publication date: 16-Sep-2024
  • (2024)GeSTICS: A Multimodal Corpus for Studying Gesture Synthesis in Two-party Interactions with Contextualized SpeechProceedings of the 24th ACM International Conference on Intelligent Virtual Agents10.1145/3652988.3673917(1-10)Online publication date: 16-Sep-2024
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media