research-article

Open access

Audiovisual Inputs for Learning Robust, Real-time Facial Animation with Lip Sync

Authors:

Iñaki Navarro,

Dario Kneubuehler,

Tijmen Verhulsdonck,

Morgan Mcguire,

Kiran BhatAuthors Info & Claims

MIG '23: Proceedings of the 16th ACM SIGGRAPH Conference on Motion, Interaction and Games

Article No.: 12, Pages 1 - 12

https://doi.org/10.1145/3623264.3624451

Published: 15 November 2023 Publication History

All formats PDF

Abstract

We present an approach for generating facial animation that combines video and audio input data in real time for low-end devices through deep learning. Our method produces control signals from audiovisual inputs separately, and mixes them to animate a character rig. The architecture relies on two specialized networks that are trained on a combination of synthetic and real world data and are highly engineered to be efficient in order to support quality avatar faces even on low-end devices. In addition, the system supports several levels of detail that degrade gracefully for additional scaling and efficiency. We showcase how user testing has been employed to improve performance and a comparison with state of the art.

Supplementary Material

MP4 File (audiovisual_face.mp4)

Supplemental Video

Download
185.56 MB

References

[1]

2009-2017. "Your Weekly Address". https://obamawhitehouse.archives.gov/briefing-room/weekly-address.

[2]

2022. Tensorflow Graph Transform. https://github.com/tensorflow/tensorflow/tree/master/tensorflow/tools/graph_transforms.

[3]

Apple. 2021. ARKit Developer Documentation. https://developer.apple.com/documentation/arkit/arfaceanchor

[4]

Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. 2020. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems 33 (2020), 12449–12460.

[5]

Adrian Bulat, Enrique Sanchez, and Georgios Tzimiropoulos. 2021. Subpixel heatmap regression for facial landmark localization. arXiv preprint arXiv:2111.02360 (2021).

[6]

A. Bulat and G. Tzimiropoulos. 2017. How Far are We from Solving the 2D and 3D Face Alignment Problem? (and a Dataset of 230,000 3D FacialLandmarks). In 2017 IEEE International Conference on Computer Vision (ICCV). IEEE Computer Society, Los Alamitos, CA, USA, 1021–1030. https://doi.org/10.1109/ICCV.2017.116

[7]

Lisha Chen, Hui Su, and Qiang Ji. 2019. Face Alignment With Kernel Density Deep Neural Network. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV). 6991–7001. https://doi.org/10.1109/ICCV.2019.00709

[8]

Xin Chen, Chen Cao, Zehao Xue, and Wei Chu. 2018. Joint Audio-Video Driven Facial Animation. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 3046–3050. https://doi.org/10.1109/ICASSP.2018.8461502

Digital Library

[9]

Daniel Cudeiro, Timo Bolkart, Cassidy Laidlaw, Anurag Ranjan, and Michael Black. 2019. Capture, Learning, and Synthesis of 3D Speaking Styles. In Proceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR). 10101–10111. http://voca.is.tue.mpg.de/

[10]

Paul Ekman and Wallace V Friesen. 1978. Facial action coding system. Environmental Psychology & Nonverbal Behavior (1978).

[11]

Yingruo Fan, Zhaojiang Lin, Jun Saito, Wenping Wang, and Taku Komura. 2022. Faceformer: Speech-driven 3d facial animation with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18770–18780.

[12]

Yao Feng, Haiwen Feng, Michael J. Black, and Timo Bolkart. 2021. Learning an Animatable Detailed 3D Face Model from In-the-Wild Images. ACM Trans. Graph. 40, 4, Article 88 (jul 2021), 13 pages. https://doi.org/10.1145/3450626.3459936

Digital Library

[13]

Yao Feng, Fan Wu, Xiaohu Shao, Yanfeng Wang, and Xi Zhou. 2018. Joint 3d face reconstruction and dense alignment with position map regression network. In Proceedings of the European conference on computer vision (ECCV). 534–551.

Digital Library

[14]

Jort F Gemmeke, Daniel PW Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R Channing Moore, Manoj Plakal, and Marvin Ritter. 2017. Audio set: An ontology and human-labeled dataset for audio events. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 776–780.

Digital Library

[15]

Ross Girshick. 2015. Fast r-cnn. In Proceedings of the IEEE international conference on computer vision. 1440–1448.

Digital Library

[16]

Alex Graves, Santiago Fernández, Faustino Gomez, and Jürgen Schmidhuber. 2006. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd international conference on Machine learning. 369–376.

Digital Library

[17]

Ivan Grishchenko, Artsiom Ablavatski, Yury Kartynnik, Karthik Raveendran, and Matthias Grundmann. 2020. Attention Mesh: High-fidelity Face Mesh Prediction in Real-time. arxiv:2006.10962 [cs.CV]

[18]

Xiaojie Guo, Siyuan Li, Jiawan Zhang, Jiayi Ma, Lin Ma, Wei Liu, and Haibin Ling. 2019. PFLD: A Practical Facial Landmark Detector. CoRR abs/1902.10859 (2019). arxiv:1902.10859http://arxiv.org/abs/1902.10859

[19]

Awni Hannun, Carl Case, Jared Casper, Bryan Catanzaro, Greg Diamos, Erich Elsen, Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, 2014. Deep speech: Scaling up end-to-end speech recognition. arXiv preprint arXiv:1412.5567 (2014).

[20]

Kazi Injamamul Haque and Zerrin Yumak. 2023. FaceXHuBERT: Text-less Speech-driven E (X) pressive 3D Facial Animation Synthesis Using Self-Supervised Speech Representation Learning. arXiv preprint arXiv:2303.05416 (2023).

[21]

Sina Honari, Pavlo Molchanov, Stephen Tyree, Pascal Vincent, Christopher Pal, and Jan Kautz. 2018. Improving Landmark Localization with Semi-Supervised Learning. arxiv:1709.01591 [cs.CV]

[22]

Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. 2021. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29 (2021), 3451–3460.

Digital Library

[23]

Ahmed Hussen Abdelaziz, Barry-John Theobald, Paul Dixon, Reinhard Knothe, Nicholas Apostoloff, and Sachin Kajareker. 2020. Modality Dropout for Improved Performance-Driven Talking Faces. In Proceedings of the 2020 International Conference on Multimodal Interaction (Virtual Event, Netherlands) (ICMI ’20). Association for Computing Machinery, New York, NY, USA, 378–386. https://doi.org/10.1145/3382507.3418840

Digital Library

[24]

Sergey Ioffe and Christian Szegedy. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning. PMLR, 448–456.

[25]

Tero Karras, Timo Aila, Samuli Laine, Antti Herva, and Jaakko Lehtinen. 2017. Audio-Driven Facial Animation by Joint End-to-End Learning of Pose and Emotion. ACM Trans. Graph. 36, 4, Article 94 (jul 2017), 12 pages. https://doi.org/10.1145/3072959.3073658

Digital Library

[26]

Abhinav Kumar, Tim K Marks, Wenxuan Mou, Ye Wang, Michael Jones, Anoop Cherian, Toshiaki Koike-Akino, Xiaoming Liu, and Chen Feng. 2020. Luvli face alignment: Estimating landmarks’ location, uncertainty, and visibility likelihood. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8236–8246.

[27]

Samuli Laine, Tero Karras, Timo Aila, Antti Herva, Shunsuke Saito, Ronald Yu, Hao Li, and Jaakko Lehtinen. 2017. Production-Level Facial Performance Capture Using Deep Convolutional Neural Networks. In Proceedings of the ACM SIGGRAPH / Eurographics Symposium on Computer Animation (Los Angeles, California) (SCA ’17). Association for Computing Machinery, New York, NY, USA, Article 10, 10 pages. https://doi.org/10.1145/3099564.3099581

Digital Library

[28]

Tianye Li, Timo Bolkart, Michael J. Black, Hao Li, and Javier Romero. 2017. Learning a Model of Facial Shape and Expression from 4D Scans. ACM Trans. Graph. 36, 6, Article 194 (nov 2017), 17 pages. https://doi.org/10.1145/3130800.3130813

Digital Library

[29]

Andrew L Maas, Awni Y Hannun, Andrew Y Ng, 2013. Rectifier nonlinearities improve neural network acoustic models. In Proc. icml, Vol. 30. Atlanta, Georgia, USA, 3.

[30]

Michael McAuliffe, Michaela Socolof, Sarah Mihuc, Michael Wagner, and Morgan Sonderegger. 2017. Montreal Forced Aligner: Trainable Text-Speech Alignment Using Kaldi. In Interspeech, Vol. 2017. 498–502.

[31]

Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. 2015. Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 5206–5210.

[32]

David I Perrett, D.Michael Burt, Ian S Penton-Voak, Kieran J Lee, Duncan A Rowland, and Rachel Edwards. 1999. Symmetry and Human Facial Attractiveness. Evolution and Human Behavior 20, 5 (1999), 295–307. https://doi.org/10.1016/S1090-5138(99)00014-8

[33]

Alexander Richard, Michael Zollhöfer, Yandong Wen, Fernando de la Torre, and Yaser Sheikh. 2021. MeshTalk: 3D Face Animation from Speech using Cross-Modality Disentanglement. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV). 1153–1162. https://doi.org/10.1109/ICCV48922.2021.00121

[34]

Joseph Roth, Sourish Chaudhuri, Ondrej Klejch, Radhika Marvin, Andrew Gallagher, Liat Kaver, Sharadh Ramaswamy, Arkadiusz Stopczynski, Cordelia Schmid, Zhonghua Xi, 2020. Ava active speaker: An audio-visual dataset for active speaker detection. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 4492–4496.

[35]

Mark Sandler, Andrew G. Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. 2018. Inverted Residuals and Linear Bottlenecks: Mobile Networks for Classification, Detection and Segmentation. CoRR abs/1801.04381 (2018). arXiv:1801.04381http://arxiv.org/abs/1801.04381

[36]

Tencent. 2023. NCNN. https://github.com/Tencent/ncnn.

[37]

Erroll Wood, Tadas Baltrušaitis, Charlie Hewitt, Matthew Johnson, Jingjing Shen, Nikola Milosavljević, Daniel Wilde, Stephan Garbin, Toby Sharp, Ivan Stojiljković, Tom Cashman, and Julien Valentin. 2022. 3D Face Reconstruction with Dense Landmarks. In Computer Vision – ECCV 2022, Shai Avidan, Gabriel Brostow, Moustapha Cissé, Giovanni Maria Farinella, and Tal Hassner (Eds.). Springer Nature Switzerland, Cham, 160–177.

Digital Library

[38]

Jinbo Xing, Menghan Xia, Yuechen Zhang, Xiaodong Cun, Jue Wang, and Tien-Tsin Wong. 2023. Codetalker: Speech-driven 3d facial animation with discrete motion prior. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12780–12790.

[39]

Yufei Xu, Jing Zhang, Qiming Zhang, and Dacheng Tao. 2022. ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation. In Advances in Neural Information Processing Systems.

[40]

Jiyuan Zhang, Franz Franchetti, and Tze Meng Low. 2018. High Performance Zero-Memory Overhead Direct Convolutions. CoRR abs/1809.10170 (2018). arXiv:1809.10170http://arxiv.org/abs/1809.10170

[41]

Kaipeng Zhang, Zhanpeng Zhang, Zhifeng Li, and Yu Qiao. 2016. Joint Face Detection and Alignment using Multi-task Cascaded Convolutional Networks. CoRR abs/1604.02878 (2016). arxiv:1604.02878http://arxiv.org/abs/1604.02878

[42]

Yang Zhou, Zhan Xu, Chris Landreth, Evangelos Kalogerakis, Subhransu Maji, and Karan Singh. 2018. Visemenet: Audio-driven animator-centric speech animation. ACM Transactions on Graphics (TOG) 37, 4 (2018), 1–10.

Digital Library

Index Terms

Audiovisual Inputs for Learning Robust, Real-time Facial Animation with Lip Sync
1. Computer systems organization
  1. Embedded and cyber-physical systems
    1. Robotics
2. Computing methodologies
  1. Machine learning

Recommendations

From 2D to 3D real-time expression transfer for facial animation

In this paper, we present a three-stage approach, which creates realistic facial animations by tracking expressions of a human face in 2D and transferring them to a human-like 3D model in real-time. Our calibration-free method, which is based on an ...
Audio2Rig: Artist-oriented deep learning tool for facial and lip sync animation
SIGGRAPH '24: ACM SIGGRAPH 2024 Talks

Creating realistic or stylized facial and lip sync animation is a tedious task. It requires lot of time and skills to sync the lips with audio and convey the right emotion to the character’s face. To allow animators to spend more time on the artistic ...
Efficient lip-synch tool for 3D cartoon animation
CASA'2008 Special Issue

We propose a set of algorithms to efficiently make speech animation for 3D cartoon characters. Our prototype system is based on blendshapes, a linear interpolation technique, which is widely used in facial animation practice. In our system, a few base ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MIG '23: Proceedings of the 16th ACM SIGGRAPH Conference on Motion, Interaction and Games

November 2023

224 pages

ISBN:9798400703935

DOI:10.1145/3623264

Editors:
Julien Pettré
Inria, France
,
Barbara Solenthaler
ETH Zurich, Switzerland
,
Rachel McDonnell
TCD, Ireland
,
Christopher Peters
KTH, Sweden

Copyright © 2023 Owner/Author.

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Sponsors

SIGGRAPH: ACM Special Interest Group on Computer Graphics and Interactive Techniques

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 November 2023

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

MIG '23

Sponsor:

SIGGRAPH

MIG '23: The 16th ACM SIGGRAPH Conference on Motion, Interaction and Games

November 15 - 17, 2023

Rennes, France

Acceptance Rates

Overall Acceptance Rate -9 of -9 submissions, 100%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
643
Total Downloads

Downloads (Last 12 months)452
Downloads (Last 6 weeks)43

Reflects downloads up to 03 Feb 2025

Other Metrics

View Author Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Figures

Tables

Media

View Table of Conten