Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3581783.3611775acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article
Open access

Speech-Driven 3D Face Animation with Composite and Regional Facial Movements

Published: 27 October 2023 Publication History

Abstract

Speech-driven 3D face animation poses significant challenges due to the intricacy and variability inherent in human facial movements. This paper emphasizes the importance of considering both the composite and regional natures of facial movements in speech-driven 3D face animation. The composite nature pertains to how speech-independent factors globally modulate speech-driven facial movements along the temporal dimension. Meanwhile, the regional nature alludes to the notion that facial movements are not globally correlated but are actuated by local musculature along the spatial dimension. It is thus indispensable to incorporate both natures for engendering vivid animation. To address the composite nature, we introduce an adaptive modulation module that employs arbitrary facial movements to dynamically adjust speech-driven facial movements across frames on a global scale. To accommodate the regional nature, our approach ensures that each constituent of the facial features for every frame focuses on the local spatial movements of 3D faces. Moreover, we present a non-autoregressive backbone for translating audio to 3D facial movements, which maintains high-frequency nuances of facial movements and facilitates efficient inference. Comprehensive experiments and user studies demonstrate that our method surpasses contemporary state-of-the-art approaches both qualitatively and quantitatively.

References

[1]
Mohammed M Alghamdi, He Wang, Andrew J Bulpitt, and David C Hogg. 2022. Talking Head from Speech Audio using a Pre-trained Image Generator. In ACM International Conference on Multimedia. 5228--5236.
[2]
Dario Amodei, Sundaram Ananthanarayanan, Rishita Anubhai, Jingliang Bai, Eric Battenberg, Carl Case, Jared Casper, Bryan Catanzaro, Qiang Cheng, Guoliang Chen, et al. 2016. Deep speech 2: End-to-end speech recognition in english and mandarin. In International conference on machine learning. PMLR, 173--182.
[3]
Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. 2020. wav2vec 2.0: A framework for self-supervised learning of speech representations. In Advances in Neural Information Processing Systems. 12449--12460.
[4]
Lele Chen, Zhiheng Li, Ross K Maddox, Zhiyao Duan, and Chenliang Xu. 2018. Lip movements generation at a glance. In European Conference on Computer Vision. 520--535.
[5]
Daniel Cudeiro, Timo Bolkart, Cassidy Laidlaw, Anurag Ranjan, and Michael J Black. 2019. Capture, learning, and synthesis of 3D speaking styles. In IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10101--10111.
[6]
Dipanjan Das, Sandika Biswas, Sanjana Sinha, and Brojeshwar Bhowmick. 2020. Speech-driven facial animation using cascaded gans for learning of motion and texture. In European Conference on Computer Vision. 408--424.
[7]
Pif Edwards, Chris Landreth, Eugene Fiume, and Karan Singh. 2016. Jali: an animator-centric viseme model for expressive lip synchronization. ACM Transactions on Graphics, Vol. 35 (2016), 1--11.
[8]
Paul Ekman and Wallace V Friesen. 1978. Facial action coding system. Environmental Psychology & Nonverbal Behavior (1978).
[9]
Bo Fan, Lijuan Wang, Frank K Soong, and Lei Xie. 2015. Photo-real talking head with deep bidirectional LSTM. In IEEE International Conference on Acoustics, Speech and Signal Processing. 4884--4888.
[10]
Yingruo Fan, Zhaojiang Lin, Jun Saito, Wenping Wang, and Taku Komura. 2022. FaceFormer: Speech-Driven 3D Facial Animation with Transformers. In IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18770--18780.
[11]
Gabriele Fanelli, Juergen Gall, Harald Romsdorfer, Thibaut Weise, and Luc Van Gool. 2010. A 3-D audio-visual corpus of affective communication. IEEE Transactions on Multimedia, Vol. 12 (2010), 591--598.
[12]
Siddharth Gururani, Arun Mallya, Ting-Chun Wang, Rafael Valle, and Ming-Yu Liu. 2022. SPACEx: Speech-driven Portrait Animation with Controllable Expression. arXiv preprint arXiv:2211.09809 (2022).
[13]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In IEEE/CVF Conference on Computer Vision and Pattern Recognition. 770--778.
[14]
Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. 2021. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing, Vol. 29 (2021), 3451--3460.
[15]
Ricong Huang, Weizhi Zhong, and Guanbin Li. 2022. Audio-driven Talking Head Generation with Transformer and 3D Morphable Model. In ACM International Conference on Multimedia. 7035--7039.
[16]
Xun Huang and Serge Belongie. 2017. Arbitrary style transfer in real-time with adaptive instance normalization. In IEEE/CVF International Conference on Computer Vision. 1501--1510.
[17]
Jewoong Hwang and Kyoungju Park. 2022. Audio-driven Facial Animation: A Survey. In IEEE International Conference on Information and Communication Technology Convergence. 614--617.
[18]
Xinya Ji, Hang Zhou, Kaisiyuan Wang, Qianyi Wu, Wayne Wu, Feng Xu, and Xun Cao. 2022. EAMM: One-shot emotional talking face via audio-based emotion-aware motion model. In ACM SIGGRAPH Conference. 1--10.
[19]
Xinya Ji, Hang Zhou, Kaisiyuan Wang, Wayne Wu, Chen Change Loy, Xun Cao, and Feng Xu. 2021. Audio-driven emotional video portraits. In IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14080--14089.
[20]
Tero Karras, Timo Aila, Samuli Laine, Antti Herva, and Jaakko Lehtinen. 2017. Audio-driven facial animation by joint end-to-end learning of pose and emotion. ACM Transactions on Graphics, Vol. 36, 4 (2017), 1--12.
[21]
Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In International Conference on Learning Representations.
[22]
Tianye Li, Timo Bolkart, Michael. J. Black, Hao Li, and Javier Romero. 2017. Learning a model of facial shape and expression from 4D scans. ACM Transactions on Graphics, Vol. 36, 6 (2017), 194:1--194:17.
[23]
Yifeng Ma, Suzhen Wang, Zhipeng Hu, Changjie Fan, Tangjie Lv, Yu Ding, Zhidong Deng, and Xin Yu. 2023. StyleTalk: One-shot Talking Head Generation with Controllable Speaking Styles. arXiv preprint arXiv:2301.01081 (2023).
[24]
Wesley Mattheyses and Werner Verhelst. 2015. Audiovisual speech synthesis: An overview of the state-of-the-art. Speech Communication, Vol. 66 (2015), 182--217.
[25]
Ziqiao Peng, Haoyu Wu, Zhenbo Song, Hao Xu, Xiangyu Zhu, Hongyan Liu, Jun He, and Zhaoxin Fan. 2023. EmoTalk: Speech-driven emotional disentanglement for 3D face animation. arXiv preprint arXiv:2303.11089 (2023).
[26]
KR Prajwal, Rudrabha Mukhopadhyay, Vinay P Namboodiri, and CV Jawahar. 2020. A lip sync expert is all you need for speech to lip generation in the wild. In ACM International Conference on Multimedia. 484--492.
[27]
ITUT Recommendation. 2006. Vocabulary for performance and quality of service. International Telecommunications Union-Radiocommunication (2006).
[28]
Alexander Richard, Michael Zollhöfer, Yandong Wen, Fernando De la Torre, and Yaser Sheikh. 2021. MeshTalk: 3D face animation from speech using cross-modality disentanglement. In IEEE/CVF International Conference on Computer Vision. 1173--1182.
[29]
Sanjana Sinha, Sandika Biswas, Ravindra Yadav, and Brojeshwar Bhowmick. 2022. Emotion-Controllable Generalized Talking Face Generation. In International Joint Conference on Artificial Intelligence. 1320--1327.
[30]
Anni Tang, Tianyu He, Xu Tan, Jun Ling, Runnan Li, Sheng Zhao, Li Song, and Jiang Bian. 2022. Memories are One-to-Many Mapping Alleviators in Talking Face Generation. arXiv preprint arXiv:2212.05005 (2022).
[31]
Sarah L Taylor, Moshe Mahler, Barry-John Theobald, and Iain Matthews. 2012. Dynamic units of visual speech. In ACM SIGGRAPH/Eurographics Conference on Computer Animation. 275--284.
[32]
Balamurugan Thambiraja, Ikhsanul Habibie, Sadegh Aliakbarian, Darren Cosker, Christian Theobalt, and Justus Thies. 2022. Imitator: Personalized Speech-driven 3D Facial Animation. arXiv preprint arXiv:2301.00023 (2022).
[33]
Robert Tibshirani. 1996. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), Vol. 58, 1 (1996), 267--288.
[34]
Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of Machine Learning Research, Vol. 9, 86 (2008), 2579--2605.
[35]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in Neural Information Processing Systems, Vol. 30 (2017), 5998--6008.
[36]
Konstantinos Vougioukas, Stavros Petridis, and Maja Pantic. 2020. Realistic speech-driven facial animation with gans. International Journal of Computer Vision, Vol. 128, 5 (2020), 1398--1413.
[37]
Haozhe Wu, Jia Jia, Haoyu Wang, Yishun Dou, Chao Duan, and Qingshan Deng. 2021. Imitating Arbitrary Talking Style for Realistic Audio-Driven Talking Face Synthesis. In ACM International Conference on Multimedia. 1478--1486.
[38]
Haozhe Wu, Jia Jia, Junliang Xing, Hongwei Xu, Xiangyuan Wang, and Jelo Wang. 2023. MMFace4D: A Large-Scale Multi-Modal 4D Face Dataset for Audio-Driven 3D Face Animation. arXiv preprint arXiv:2303.09797 (2023).
[39]
Zhiyong Wu, Shen Zhang, Lianhong Cai, and Helen M Meng. 2006. Real-time synthesis of Chinese visual speech and facial expressions using MPEG-4 FAP features in a three-dimensional avatar. In Annual Conference of the International Speech Communication Association. 1802--1805.
[40]
Jinbo Xing, Menghan Xia, Yuechen Zhang, Xiaodong Cun, Jue Wang, and Tien-Tsin Wong. 2023. CodeTalker: Speech-Driven 3D Facial Animation with Discrete Motion Prior. In IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[41]
Yuyu Xu, Andrew W Feng, Stacy Marsella, and Ari Shapiro. 2013. A practical and configurable lip sync method for games. In Motion on Games. 131--140.
[42]
Zhenhui Ye, Ziyue Jiang, Yi Ren, Jinglin Liu, Jinzheng He, and Zhou Zhao. 2023. GeneFace: Generalized and High-Fidelity Audio-Driven 3D Talking Face Synthesis. arXiv preprint arXiv:2301.13430 (2023).
[43]
Ran Yi, Zipeng Ye, Juyong Zhang, Hujun Bao, and Yong-Jin Liu. 2020. Audio-driven talking face video generation with learning-based personalized head pose. arXiv e-prints (2020), arXiv-2002.
[44]
Hang Zhou, Yu Liu, Ziwei Liu, Ping Luo, and Xiaogang Wang. 2019. Talking face generation by adversarially disentangled audio-visual representation. In AAAI Conference on Artificial Intelligence. 9299--9306.
[45]
Yang Zhou, Zhan Xu, Chris Landreth, Evangelos Kalogerakis, Subhransu Maji, and Karan Singh. 2018. Visemenet: Audio-driven animator-centric speech animation. ACM Transactions on Graphics, Vol. 37, 4 (2018), 1--10.

Cited By

View all
  • (2024)PersonaTalk: Bring Attention to Your Persona in Visual DubbingSIGGRAPH Asia 2024 Conference Papers10.1145/3680528.3687618(1-9)Online publication date: 3-Dec-2024
  • (2024)MMHead: Towards Fine-grained Multi-modal 3D Facial AnimationProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681366(7966-7975)Online publication date: 28-Oct-2024
  • (2024)Expressive Talking AvatarsIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2024.337204730:5(2538-2548)Online publication date: 4-Mar-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MM '23: Proceedings of the 31st ACM International Conference on Multimedia
October 2023
9913 pages
ISBN:9798400701085
DOI:10.1145/3581783
This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 October 2023

Check for updates

Author Tags

  1. composite
  2. non-autoregressive
  3. regional
  4. speech-driven 3d face animation

Qualifiers

  • Research-article

Funding Sources

  • National Key R&D Program of China
  • the State Key Program of the National Natural Science Foundation of China
  • Natural Science Foundation of China

Conference

MM '23
Sponsor:
MM '23: The 31st ACM International Conference on Multimedia
October 29 - November 3, 2023
Ottawa ON, Canada

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)516
  • Downloads (Last 6 weeks)64
Reflects downloads up to 01 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)PersonaTalk: Bring Attention to Your Persona in Visual DubbingSIGGRAPH Asia 2024 Conference Papers10.1145/3680528.3687618(1-9)Online publication date: 3-Dec-2024
  • (2024)MMHead: Towards Fine-grained Multi-modal 3D Facial AnimationProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681366(7966-7975)Online publication date: 28-Oct-2024
  • (2024)Expressive Talking AvatarsIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2024.337204730:5(2538-2548)Online publication date: 4-Mar-2024
  • (2024)Analyzing Visible Articulatory Movements in Speech Production For Speech-Driven 3D Facial Animation2024 IEEE International Conference on Image Processing (ICIP)10.1109/ICIP51287.2024.10647359(3575-3579)Online publication date: 27-Oct-2024
  • (2024)3D facial modeling, animation, and rendering for digital humans: A surveyNeurocomputing10.1016/j.neucom.2024.128168598(128168)Online publication date: Sep-2024
  • (2024)MambaTalk: Speech-Driven 3D Facial Animation with MambaMultiMedia Modeling10.1007/978-981-96-2061-6_23(310-323)Online publication date: 31-Dec-2024
  • (2024)KMTalk: Speech-Driven 3D Facial Animation with Key Motion EmbeddingComputer Vision – ECCV 202410.1007/978-3-031-72992-8_14(236-253)Online publication date: 30-Oct-2024
  • (2024)UniTalker: Scaling up Audio-Driven 3D Facial Animation Through A Unified ModelComputer Vision – ECCV 202410.1007/978-3-031-72940-9_12(204-221)Online publication date: 17-Nov-2024

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media