Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

MakeltTalk: speaker-aware talking-head animation

Published: 27 November 2020 Publication History
  • Get Citation Alerts
  • Abstract

    We present a method that generates expressive talking-head videos from a single facial image with audio as the only input. In contrast to previous attempts to learn direct mappings from audio to raw pixels for creating talking faces, our method first disentangles the content and speaker information in the input audio signal. The audio content robustly controls the motion of lips and nearby facial regions, while the speaker information determines the specifics of facial expressions and the rest of the talking-head dynamics. Another key component of our method is the prediction of facial landmarks reflecting the speaker-aware dynamics. Based on this intermediate representation, our method works with many portrait images in a single unified framework, including artistic paintings, sketches, 2D cartoon characters, Japanese mangas, and stylized caricatures. In addition, our method generalizes well for faces and characters that were not observed during training. We present extensive quantitative and qualitative evaluation of our method, in addition to user studies, demonstrating generated talking-heads of significantly higher quality compared to prior state-of-the-art methods.

    Supplementary Material

    MP4 File (a221-zhou.mp4)
    MP4 File (3414685.3417774.mp4)
    Presentation video

    References

    [1]
    Shruti Agarwal, Hany Farid, Yuming Gu, Mingming He, Koki Nagano, and Hao Li. 2019. Protecting World Leaders Against Deep Fakes. In Proc. CVPRW.
    [2]
    Hadar Averbuch-Elor, Daniel Cohen-Or, Johannes Kopf, and Michael F Cohen. 2017. Bringing portraits to life. ACM Trans. Graphics (2017).
    [3]
    Thabo Beeler and Derek Bradley. 2014. Rigid stabilization of facial expressions. ACM Trans. Graphics (2014).
    [4]
    Matthew Brand. 1999. Voice puppetry. In Proc. SIGGRAPH.
    [5]
    Adrian Bulat and Georgios Tzimiropoulos. 2017. How far are we from solving the 2d & 3d face alignment problem? (and a dataset of 230,000 3d facial landmarks). In Proc. ICCV.
    [6]
    Lele Chen, Zhiheng Li, Ross K Maddox, Zhiyao Duan, and Chenliang Xu. 2018. Lip movements generation at a glance. In Proc. ECCV.
    [7]
    Lele Chen, Ross K Maddox, Zhiyao Duan, and Chenliang Xu. 2019. Hierarchical cross-modal talking face generation with dynamic pixel-wise loss. In Proc. CVPR.
    [8]
    Joon Son Chung, Amir Jamaludin, and Andrew Zisserman. 2017. You said that?. In Proc. BMVC.
    [9]
    Joon Son Chung, Arsha Nagrani, and Andrew Zisserman. 2018. VoxCeleb2: Deep Speaker Recognition. In Proc. INTERSPEECH.
    [10]
    Daniel Cudeiro, Timo Bolkart, Cassidy Laidlaw, Anurag Ranjan, and Michael J Black. 2019. Capture, Learning, and Synthesis of 3D Speaking Styles. In Proc. CVPR.
    [11]
    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805 (2018).
    [12]
    Pif Edwards, Chris Landreth, Eugene Fiume, and Karan Singh. 2016. JALI: an animator-centric viseme model for expressive lip synchronization. ACM Trans. Graphics (2016).
    [13]
    Paul Ekman and Wallace V Friesen. 1978. Facial action coding system: a technique for the measurement of facial movement. (1978).
    [14]
    Sefik Emre Eskimez, Ross K Maddox, Chenliang Xu, and Zhiyao Duan. 2018. Generating talking face landmarks from speech. In Proc. LVA/ICA.
    [15]
    Sefik Emre Eskimez, Ross K Maddox, Chenliang Xu, and Zhiyao Duan. 2019. Noise-Resilient Training Method for Face Landmark Generation From Speech. IEEE/ACM Trans. Audio, Speech, and Language Processing (2019).
    [16]
    Patrick Esser, Ekaterina Sutter, and Björn Ommer. 2018. A variational u-net for conditional appearance and shape generation. In Proc. CVPR.
    [17]
    Gary Faigin. 2012. The artist's complete guide to facial expression. Watson-Guptill.
    [18]
    Jakub Fišer, Ondřej Jamriška, David Simons, Eli Shechtman, Jingwan Lu, Paul Asente, Michal Lukáč, and Daniel Sýkora. 2017. Example-Based Synthesis of Stylized Facial Animations. ACM Trans. Graphics (2017).
    [19]
    Chuang Gan, Deng Huang, Peihao Chen, Joshua B Tenenbaum, and Antonio Torralba. 2020a. Foley Music: Learning to Generate Music from Videos. ECCV (2020).
    [20]
    Chuang Gan, Deng Huang, Hang Zhao, Joshua B Tenenbaum, and Antonio Torralba. 2020b. Music Gesture for Visual Sound Separation. In CVPR. 10478--10487.
    [21]
    Pablo Garrido, Levi Valgaerts, Hamid Sarmadi, Ingmar Steiner, Kiran Varanasi, Patrick Perez, and Christian Theobalt. 2015. Vdub: Modifying face video of actors for plausible visual alignment to a dubbed audio track. In Computer graphics forum.
    [22]
    Shiry Ginosar, Amir Bar, Gefen Kohavi, Caroline Chan, Andrew Owens, and Jitendra Malik. 2019. Learning Individual Styles of Conversational Gesture. In Proc. CVPR.
    [23]
    Alex Graves and Navdeep Jaitly. 2014. Towards end-to-end speech recognition with recurrent neural networks. In Proc. ICML.
    [24]
    David Greenwood, Iain Matthews, and Stephen Laycock. 2018. Joint learning of facial expression and head pose from speech. Interspeech.
    [25]
    Xintong Han, Zuxuan Wu, Weilin Huang, Matthew R Scott, and Larry S Davis. 2019. FiNet: Compatible and Diverse Fashion Image Inpainting. In Proc. ICCV.
    [26]
    Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. 2017. Image-to-image translation with conditional adversarial networks. In Proc. CVPR.
    [27]
    Justin Johnson, Alexandre Alahi, and Li Fei-Fei. 2016. Perceptual losses for real-time style transfer and super-resolution. In Proc. ECCV.
    [28]
    Tero Karras, Timo Aila, Samuli Laine, Antti Herva, and Jaakko Lehtinen. 2017. Audio-driven facial animation by joint end-to-end learning of pose and emotion. ACM Trans. Graphics (2017).
    [29]
    Hyeongwoo Kim, Mohamed Elgharib, Michael Zollhöfer, Hans-Peter Seidel, Thabo Beeler, Christian Richardt, and Christian Theobalt. 2019. Neural style-preserving visual dubbing. ACM Trans. Graphics (2019).
    [30]
    Yilong Liu, Feng Xu, Jinxiang Chai, Xin Tong, Lijuan Wang, and Qiang Huo. 2015. Video-audio driven real-time facial animation. ACM Trans. Graphics (2015).
    [31]
    Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. JMLR (2008).
    [32]
    Xudong Mao, Qing Li, Haoran Xie, Raymond YK Lau, Zhen Wang, and Stephen Paul Smolley. 2017. Least squares generative adversarial networks. In Proc. ICCV.
    [33]
    Harry McGurk and John MacDonald. 1976. Hearing lips and seeing voices. Nature (1976).
    [34]
    Notevibes. 2020. Text to Speech converter. https://notevibes.com/.
    [35]
    Hai Xuan Pham, Yuting Wang, and Vladimir Pavlovic. 2018. End-to-end learning for 3d facial animation from speech. In Proc. ICMI.
    [36]
    Albert Pumarola, Antonio Agudo, Aleix M Martinez, Alberto Sanfeliu, and Francesc Moreno-Noguer. 2018. Ganimation: Anatomically-aware facial animation from a single image. In Proc. ECCV.
    [37]
    Kaizhi Qian, Yang Zhang, Shiyu Chang, Xuesong Yang, and Mark Hasegawa-Johnson. 2019. AUTOVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss. In Proc. ICML. 5210--5219.
    [38]
    Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-net: Convolutional networks for biomedical image segmentation. In Proc. MICCAI.
    [39]
    Aleksandr Segal, Dirk Haehnel, and Sebastian Thrun. 2009. Generalized-icp., In Proc. Robotics: science and systems.
    [40]
    Soumyadip Sengupta, Vivek Jayaram, Brian Curless, Steve Seitz, and Ira Kemelmacher-Shlizerman. 2020. Background Matting: The World is Your Green Screen. In Proc. CVPR.
    [41]
    Takaaki Shiratori, Atsushi Nakazawa, and Katsushi Ikeuchi. 2006. Dancing-to-music character animation. In Computer Graphics Forum.
    [42]
    Aliaksandr Siarohin, Stéphane Lathuilière, Sergey Tulyakov, Elisa Ricci, and Nicu Sebe. 2019. First Order Motion Model for Image Animation. In Conference on Neural Information Processing Systems (NeurIPS).
    [43]
    Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. In Proc. ICLR.
    [44]
    Yang Song, Jingwen Zhu, Dawei Li, Andy Wang, and Hairong Qi. 2019. Talking Face Generation by Conditional Recurrent Adversarial Network. In Proc. IJCAI.
    [45]
    Olga Sorkine. 2006. Differential representations for mesh processing. In Computer Graphics Forum.
    [46]
    Yannis Stylianou. 2009. Voice transformation: a survey. In Proc. ICASSP.
    [47]
    Supasorn Suwajanakorn, Steven M Seitz, and Ira Kemelmacher-Shlizerman. 2017. Synthesizing obama: learning lip sync from audio. ACM Trans. Graphics (2017).
    [48]
    Sarah Taylor, Taehwan Kim, Yisong Yue, Moshe Mahler, James Krahe, Anastasio Garcia Rodriguez, Jessica Hodgins, and Iain Matthews. 2017. A deep learning approach for generalized speech animation. ACM Trans. Graphics (2017).
    [49]
    Justus Thies, Mohamed Elgharib, Ayush Tewari, Christian Theobalt, and Matthias Nießner. 2020. Neural Voice Puppetry: Audio-driven Facial Reenactment. In Proc. CVPR, to appear.
    [50]
    Justus Thies, Michael Zollhofer, Marc Stamminger, Christian Theobalt, and Matthias Nießner. 2016. Face2face: Real-time face capture and reenactment of rgb videos. In Proc. CVPR.
    [51]
    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proc. NeurIPS.
    [52]
    Christophe Veaux, Junichi Yamagishi, Kirsten MacDonald, et al. 2016. Superseded-cstr vctk corpus: English multi-speaker corpus for cstr voice cloning toolkit. (2016).
    [53]
    Konstantinos Vougioukas, Stavros Petridis, and Maja Pantic. 2019. Realistic Speech-Driven Facial Animation with GANs. IJCV (2019).
    [54]
    Marilyn A Walker, Janet E Cahn, and Stephen J Whittaker. 1997. Improvising linguistic style: Social and affective bases for agent personality. In Proc. IAA.
    [55]
    Li Wan, Quan Wang, Alan Papir, and Ignacio Lopez Moreno. 2018. Generalized end-to-end loss for speaker verification. In Proc. ICASSP.
    [56]
    Chung-Yi Weng, Brian Curless, and Ira Kemelmacher-Shlizerman. 2019. Photo wake-up: 3d character animation from a single photo. In Proc. CVPR.
    [57]
    Jordan Yaniv, Yael Newman, and Ariel Shamir. 2019. The face of art: landmark detection and geometric style in portraits. ACM Trans. Graphics (2019).
    [58]
    Egor Zakharov, Aliaksandra Shysheya, Egor Burkov, and Victor Lempitsky. 2019. Few-Shot Adversarial Learning of Realistic Neural Talking Head Models. In Proc. ICCV.
    [59]
    Hang Zhou, Yu Liu, Ziwei Liu, Ping Luo, and Xiaogang Wang. 2019. Talking face generation by adversarially disentangled audio-visual representation. In Proc. AAAI.
    [60]
    Yang Zhou, Zhan Xu, Chris Landreth, Evangelos Kalogerakis, Subhransu Maji, and Karan Singh. 2018. Visemenet: Audio-driven animator-centric speech animation. ACM Trans. Graphics (2018).

    Cited By

    View all
    • (2024)Facial Animation Strategies for Improved Emotional Expression in Virtual RealityElectronics10.3390/electronics1313260113:13(2601)Online publication date: 2-Jul-2024
    • (2024)Multi-Modal Driven Pose-Controllable Talking Head GenerationACM Transactions on Multimedia Computing, Communications, and Applications10.1145/3673901Online publication date: 10-Aug-2024
    • (2024)A Hierarchical 3D Gaussian Representation for Real-Time Rendering of Very Large DatasetsACM Transactions on Graphics10.1145/365816043:4(1-15)Online publication date: 19-Jul-2024
    • Show More Cited By

    Index Terms

    1. MakeltTalk: speaker-aware talking-head animation

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image ACM Transactions on Graphics
        ACM Transactions on Graphics  Volume 39, Issue 6
        December 2020
        1605 pages
        ISSN:0730-0301
        EISSN:1557-7368
        DOI:10.1145/3414685
        Issue’s Table of Contents
        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 27 November 2020
        Published in TOG Volume 39, Issue 6

        Permissions

        Request permissions for this article.

        Check for updates

        Author Tags

        1. facial animation
        2. neural networks

        Qualifiers

        • Research-article

        Funding Sources

        • NSF
        • Adobe

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)754
        • Downloads (Last 6 weeks)31
        Reflects downloads up to 10 Aug 2024

        Other Metrics

        Citations

        Cited By

        View all
        • (2024)Facial Animation Strategies for Improved Emotional Expression in Virtual RealityElectronics10.3390/electronics1313260113:13(2601)Online publication date: 2-Jul-2024
        • (2024)Multi-Modal Driven Pose-Controllable Talking Head GenerationACM Transactions on Multimedia Computing, Communications, and Applications10.1145/3673901Online publication date: 10-Aug-2024
        • (2024)A Hierarchical 3D Gaussian Representation for Real-Time Rendering of Very Large DatasetsACM Transactions on Graphics10.1145/365816043:4(1-15)Online publication date: 19-Jul-2024
        • (2024)Toward Photo-Realistic Facial Animation Generation Based on Keypoint FeaturesProceedings of the 2024 16th International Conference on Machine Learning and Computing10.1145/3651671.3651731(334-339)Online publication date: 2-Feb-2024
        • (2024)Let the Beat Follow You - Creating Interactive Drum Sounds From Body Rhythm2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)10.1109/WACV57701.2024.00702(7162-7172)Online publication date: 3-Jan-2024
        • (2024) DR 2 : Disentangled Recurrent Representation Learning for Data-efficient Speech Video Synthesis 2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)10.1109/WACV57701.2024.00609(6192-6202)Online publication date: 3-Jan-2024
        • (2024)Controlling Character Motions without Observable Driving Source2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)10.1109/WACV57701.2024.00608(6182-6191)Online publication date: 3-Jan-2024
        • (2024)Diff2Lip: Audio Conditioned Diffusion Models for Lip-Synchronization2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)10.1109/WACV57701.2024.00521(5280-5290)Online publication date: 3-Jan-2024
        • (2024)Diffused Heads: Diffusion Models Beat GANs on Talking-Face Generation2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)10.1109/WACV57701.2024.00502(5089-5098)Online publication date: 3-Jan-2024
        • (2024)Emotional Talking Face Generation with a Single Image2024 21st International Conference on Ubiquitous Robots (UR)10.1109/UR61395.2024.10597491(655-662)Online publication date: 24-Jun-2024
        • Show More Cited By

        View Options

        Get Access

        Login options

        Full Access

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media