research-article

Joint Audio-Text Model for Expressive Speech-Driven 3D Facial Animation

Authors:

Taku KomuraAuthors Info & Claims

Proceedings of the ACM on Computer Graphics and Interactive Techniques, Volume 5, Issue 1

Article No.: 16, Pages 1 - 15

https://doi.org/10.1145/3522615

Published: 04 May 2022 Publication History

Abstract

Speech-driven 3D facial animation with accurate lip synchronization has been widely studied. However, synthesizing realistic motions for the entire face during speech has rarely been explored. In this work, we present a joint audio-text model to capture the contextual information for expressive speech-driven 3D facial animation. The existing datasets are collected to cover as many different phonemes as possible instead of sentences, thus limiting the capability of the audio-based model to learn more diverse contexts. To address this, we propose to leverage the contextual text embeddings extracted from the powerful pre-trained language model that has learned rich contextual representations from large-scale text data. Our hypothesis is that the text features can disambiguate the variations in upper face expressions, which are not strongly correlated with the audio. In contrast to prior approaches which learn phoneme-level features from the text, we investigate the high-level contextual text features for speech-driven 3D facial animation. We show that the combined acoustic and textual modalities can synthesize realistic facial expressions while maintaining audio-lip synchronization. We conduct the quantitative and qualitative evaluations as well as the perceptual user study. The results demonstrate the superior performance of our model against existing state-of-the-art approaches.

Supplementary Material

fan (fan.zip)

Supplemental movie, appendix, image and software files for, Joint Audio-Text Model for Expressive Speech-Driven 3D Facial Animation

Download
4.97 MB

References

[1]

Tadas Baltrušaitis, Peter Robinson, and Louis-Philippe Morency. 2016. Openface: an open source facial behavior analysis toolkit. In 2016 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 1--10.

[2]

Yong Cao, Wen C Tien, Petros Faloutsos, and Frédéric Pighin. 2005. Expressive speech-driven facial animation. ACM Transactions on Graphics (TOG) 24, 4 (2005), 1283--1302.

Digital Library

[3]

Lele Chen, Guofeng Cui, Celong Liu, Zhong Li, Ziyi Kou, Yi Xu, and Chenliang Xu. 2020. Talking-head generation with rhythmic head motion. In Proceedings of the European Conference on Computer Vision. Springer, 35--51.

Digital Library

[4]

Lele Chen, Zhiheng Li, Ross K Maddox, Zhiyao Duan, and Chenliang Xu. 2018. Lip movements generation at a glance. In Proceedings of the European Conference on Computer Vision. 520--535.

[5]

Lele Chen, Ross K Maddox, Zhiyao Duan, and Chenliang Xu. 2019. Hierarchical cross-modal talking face generation with dynamic pixel-wise loss. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7832--7841.

[6]

Joon Son Chung, Amir Jamaludin, and Andrew Zisserman. 2017. You said that? arXiv preprint arXiv:1705.02966 (2017).

[7]

Daniel Cudeiro, Timo Bolkart, Cassidy Laidlaw, Anurag Ranjan, and Michael J Black. 2019. Capture, learning, and synthesis of 3D speaking styles. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10101--10111.

[8]

Dipanjan Das, Sandika Biswas, Sanjana Sinha, and Brojeshwar Bhowmick. 2020. Speech-driven facial animation using cascaded gans for learning of motion and texture. In European Conference on Computer Vision. Springer, 408--424.

Digital Library

[9]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).

[10]

P Eckman and W Friesen. 1978. Facial action coding system: A technique for the measurement of facial movement. Consulting Psychologists Press (1978).

[11]

Pif Edwards, Chris Landreth, Eugene Fiume, and Karan Singh. 2016. JALI: an animator-centric viseme model for expressive lip synchronization. ACM Transactions on graphics (TOG) 35, 4 (2016), 1--11.

Digital Library

[12]

Gabriele Fanelli, Juergen Gall, Harald Romsdorfer, Thibaut Weise, and Luc Van Gool. 2010. A 3-d audio-visual corpus of affective communication. IEEE Transactions on Multimedia 12, 6 (2010), 591--598.

Digital Library

[13]

Cletus G Fisher. 1968. Confusions among visually perceived consonants. Journal of speech and hearing research 11, 4 (1968), 796--804.

[14]

Wallace V Friesen, Paul Ekman, et al. 1983. EMFACS-7: Emotional facial action coding system. Unpublished manuscript, University of California at San Francisco 2, 36 (1983), 1.

[15]

Awni Hannun, Carl Case, Jared Casper, Bryan Catanzaro, Greg Diamos, Erich Elsen, Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, et al. 2014. Deep speech: Scaling up end-to-end speech recognition. arXiv preprint arXiv:1412.5567 (2014).

[16]

Xinya Ji, Hang Zhou, Kaisiyuan Wang, Wayne Wu, Chen Change Loy, Xun Cao, and Feng Xu. 2021. Audio-driven emotional video portraits. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 14080--14089.

[17]

Tero Karras, Timo Aila, Samuli Laine, Antti Herva, and Jaakko Lehtinen. 2017. Audio-driven facial animation by joint end-to-end learning of pose and emotion. ACM Transactions on Graphics (TOG) 36, 4 (2017), 1--12.

Digital Library

[18]

Taras Kucherenko, Patrik Jonell, Sanne van Waveren, Gustav Eje Henter, Simon Alexandersson, Iolanda Leite, and Hedvig Kjellström. 2020. Gesticulator: A framework for semantically-aware speech-driven gesture generation. In Proceedings of the 2020 International Conference on Multimodal Interaction. 242--250.

Digital Library

[19]

DW Massaro, MM Cohen, M Tabain, J Beskow, and R Clark. 2012. Animated speech: research progress and applications. Audiovisual Speech Processing (2012), 309--345.

[20]

Brian McFee, Colin Raffel, Dawen Liang, Daniel PW Ellis, Matt McVicar, Eric Battenberg, and Oriol Nieto. 2015. librosa: Audio and music signal analysis in python. In Proceedings of the 14th python in science conference, Vol. 8. Citeseer, 18--25.

[21]

Gaurav Mittal and Baoyuan Wang. 2020. Animating face using disentangled audio representations. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision. 3290--3298.

[22]

RM Ochshorn and Max Hawkins. 2017. Gentle forced aligner. github.com/lowerquality/gentle (2017).

[23]

Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. 2016. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499 (2016).

[24]

Hai X Pham, Samuel Cheung, and Vladimir Pavlovic. 2017. Speech-driven 3D facial animation with implicit emotional awareness: a deep learning approach. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 80--88.

[25]

Soujanya Poria, Iti Chaturvedi, Erik Cambria, and Amir Hussain. 2016. Convolutional MKL based multimodal emotion recognition and sentiment analysis. In 2016 IEEE 16th international conference on data mining (ICDM). IEEE, 439--448.

[26]

KR Prajwal, Rudrabha Mukhopadhyay, Vinay P Namboodiri, and CV Jawahar. 2020. A lip sync expert is all you need for speech to lip generation in the wild. In Proceedings of the 28th ACM International Conference on Multimedia. 484--492.

Digital Library

[27]

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. 2019. Language models are unsupervised multitask learners. OpenAI blog 1, 8 (2019), 9.

[28]

Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. 2021. Zero-shot text-to-image generation. arXiv preprint arXiv:2102.12092 (2021).

[29]

Alexander Richard, Michael Zollhoefer, Yandong Wen, Fernando De la Torre, and Yaser Sheikh. 2021. MeshTalk: 3D Face Animation from Speech using Cross-Modality Disentanglement. arXiv preprint arXiv:2104.08223 (2021).

[30]

Matthias J Sjerps, Neal P Fox, Keith Johnson, and Edward F Chang. 2019. Speaker-normalized sound representations in the human auditory cortex. Nature communications 10, 1 (2019), 1--9.

[31]

Supasorn Suwajanakorn, Steven M Seitz, and Ira Kemelmacher-Shlizerman. 2017. Synthesizing obama: learning lip sync from audio. ACM Transactions on Graphics (ToG) 36, 4 (2017), 1--13.

Digital Library

[32]

Sarah Taylor, Taehwan Kim, Yisong Yue, Moshe Mahler, James Krahe, Anastasio Garcia Rodriguez, Jessica Hodgins, and Iain Matthews. 2017. A deep learning approach for generalized speech animation. ACM Transactions on Graphics (TOG) 36, 4 (2017), 1--11.

Digital Library

[33]

Sarah L Taylor, Moshe Mahler, Barry-John Theobald, and Iain Matthews. 2012. Dynamic units of visual speech. In Proceedings of the 11th ACM SIGGRAPH/Eurographics conference on Computer Animation. 275--284.

[34]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems. 5998--6008.

[35]

Konstantinos Vougioukas, Stavros Petridis, and Maja Pantic. 2020. Realistic speech-driven facial animation with gans. International Journal of Computer Vision 128, 5 (2020), 1398--1413.

Digital Library

[36]

Qianyun Wang, Zhenfeng Fan, and Shihong Xia. 2021. 3D-TalkEmo: Learning to Synthesize 3D Emotional Talking Head. arXiv preprint arXiv:2104.12051 (2021).

[37]

Olivia Wiles, A Koepke, and Andrew Zisserman. 2018. X2face: A network for controlling face generation using images, audio, and pose codes. In Proceedings of the European Conference on Computer Vision. 670--686.

[38]

Yuyu Xu, Andrew W Feng, Stacy Marsella, and Ari Shapiro. 2013. A practical and configurable lip sync method for games. In Proceedings of Motion on Games. 131--140.

Digital Library

[39]

Youngwoo Yoon, Bok Cha, Joo-Haeng Lee, Minsu Jang, Jaeyeon Lee, Jaehong Kim, and Geehyuk Lee. 2020. Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics (TOG) 39, 6 (2020), 1--16.

Digital Library

[40]

Fisher Yu and Vladlen Koltun. 2015. Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122 (2015).

[41]

Amir Zadeh, Minghai Chen, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. 2017. Tensor fusion network for multimodal sentiment analysis. arXiv preprint arXiv:1707.07250 (2017).

[42]

Dan Zeng, Han Liu, Hui Lin, and Shiming Ge. 2020. Talking Face Generation with Expression-Tailored Generative Adversarial Network. In Proceedings of the 28th ACM International Conference on Multimedia. 1716--1724.

Digital Library

[43]

Hang Zhou, Yu Liu, Ziwei Liu, Ping Luo, and Xiaogang Wang. 2019. Talking face generation by adversarially disentangled audio-visual representation. In Proceedings of the AAAI Conference on Artificial Intelligence. 9299--9306.

Digital Library

[44]

Hang Zhou, Yasheng Sun, Wayne Wu, Chen Change Loy, Xiaogang Wang, and Ziwei Liu. 2021. Pose-controllable talking face generation by implicitly modularized audio-visual representation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4176--4186.

[45]

Yang Zhou, Zhan Xu, Chris Landreth, Evangelos Kalogerakis, Subhransu Maji, and Karan Singh. 2018. Visemenet: Audio-driven animator-centric speech animation. ACM Transactions on Graphics (TOG) 37, 4 (2018), 1--10.

Digital Library

Cited By

Pan YAgrawal RSingh K(2024)S3: Speech, Script and Scene driven Head and Eye AnimationACM Transactions on Graphics10.1145/365817243:4(1-12)Online publication date: 19-Jul-2024
https://dl.acm.org/doi/10.1145/3658172
Zhao QLong PZhang QQin DLiang HZhang LZhang YYu JXu L(2024)Media2Face: Co-speech Facial Animation Generation With Multi-Modality GuidanceACM SIGGRAPH 2024 Conference Papers10.1145/3641519.3657413(1-13)Online publication date: 13-Jul-2024
https://dl.acm.org/doi/10.1145/3641519.3657413
Stan SHaque KYumak Z(2023)FaceDiffuser: Speech-Driven 3D Facial Animation Synthesis Using DiffusionProceedings of the 16th ACM SIGGRAPH Conference on Motion, Interaction and Games10.1145/3623264.3624447(1-11)Online publication date: 15-Nov-2023
https://dl.acm.org/doi/10.1145/3623264.3624447

Index Terms

Joint Audio-Text Model for Expressive Speech-Driven 3D Facial Animation
1. Computing methodologies
  1. Computer graphics
    1. Animation

Recommendations

A facial animation model for expressive audio-visual speech
Expressive Facial Animation Synthesis by Learning Speech Coarticulation and Expression Spaces

Synthesizing expressive facial animation is a very challenging topic within the graphics community. In this paper, we present an expressive facial animation synthesis system enabled by automated learning from facial motion capture data. Accurate 3D ...
Speech driven facial animation
PUI '01: Proceedings of the 2001 workshop on Perceptive user interfaces

The results reported in this article are an integral part of a larger project aimed at achieving perceptually realistic animations, including the individualized nuances, of three-dimensional human faces driven by speech. The audiovisual system that has ...

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the ACM on Computer Graphics and Interactive Techniques

Proceedings of the ACM on Computer Graphics and Interactive Techniques Volume 5, Issue 1

May 2022

252 pages

EISSN:2577-6193

DOI:10.1145/3535313

Issue’s Table of Contents

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 May 2022

Published in PACMCGIT Volume 5, Issue 1

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

New Energy and Industrial Technology Development Organization

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
486
Total Downloads

Downloads (Last 12 months)140
Downloads (Last 6 weeks)10

Reflects downloads up to 11 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Pan YAgrawal RSingh K(2024)S3: Speech, Script and Scene driven Head and Eye AnimationACM Transactions on Graphics10.1145/365817243:4(1-12)Online publication date: 19-Jul-2024
https://dl.acm.org/doi/10.1145/3658172
Zhao QLong PZhang QQin DLiang HZhang LZhang YYu JXu L(2024)Media2Face: Co-speech Facial Animation Generation With Multi-Modality GuidanceACM SIGGRAPH 2024 Conference Papers10.1145/3641519.3657413(1-13)Online publication date: 13-Jul-2024
https://dl.acm.org/doi/10.1145/3641519.3657413
Stan SHaque KYumak Z(2023)FaceDiffuser: Speech-Driven 3D Facial Animation Synthesis Using DiffusionProceedings of the 16th ACM SIGGRAPH Conference on Motion, Interaction and Games10.1145/3623264.3624447(1-11)Online publication date: 15-Nov-2023
https://dl.acm.org/doi/10.1145/3623264.3624447

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents