Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3503047.3503054acmotherconferencesArticle/Chapter ViewAbstractPublication PagesaissConference Proceedingsconference-collections
research-article

Few-shot Adversarial Audio Driving Talking Face Generation

Published: 19 January 2022 Publication History

Abstract

Talking-face generation is an interesting and challenging problem in computer vision and has become a research focus. This project aims to generate real talking-face video sequences, especially lip synchronization and head motion. In order to create a personalized talking-face model, these works require training on large-scale audio-visual datasets. However, in many practical scenarios, the personalized appearance features, and audio-video synchronization relationships need to be learned from a few lip synchronization sequences. In this paper, we consider it as a few-shot image synchronization problem: synthesizing talking-face with audio if there are additionally a few lip-synchronized video sequences as the learning task? We apply the reptile methods to train the meta adversarial networks and this meta-model could be adapted on just a few references sequences and done quickly to learn the personalized references models. With meta-learning on the dataset, the model can learn the initialization parameters. And with few adapt steps on the reference sequences, the model can learn quickly and generate highly realistic images with more facial texture and lip-sync. Experiments on several datasets demonstrated significantly better results obtained by our methods than the state-of-the-art methods in both quantitative and quantitative comparisons.

References

[1]
2019. Talking Face Generation by Conditional Recurrent Adversarial Network. International Joint Conferences on Artificial Intelligence Organization, Macao, China, 919–925. https://doi.org/10.24963/ijcai.2019/129
[2]
Houwei Cao, David G Cooper, Michael K Keutmann, Ruben C Gur, Ani Nenkova, and Ragini Verma. 2014. CREMA-D: Crowd-sourced emotional multimodal actors dataset. IEEE transactions on affective computing 5, 4 (2014), 377–390.
[3]
Lele Chen, Guofeng Cui, Celong Liu, Zhong Li, Ziyi Kou, Yi Xu, and Chenliang Xu. 2020. Talking-Head Generation with Rhythmic Head Motion. arXiv:2007.08547 [cs](2020). arxiv:2007.08547 [cs]
[4]
Lele Chen, Zhiheng Li, Ross K. Maddox, Zhiyao Duan, and Chenliang Xu. 2018. Lip Movements Generation at a Glance. arXiv:1803.10404 [cs](2018). arxiv:1803.10404 [cs]
[5]
Lele Chen, Ross K. Maddox, Zhiyao Duan, and Chenliang Xu. 2019. Hierarchical Cross-Modal Talking Face Generationwith Dynamic Pixel-Wise Loss. arXiv:1905.03820 [cs](2019). arxiv:1905.03820 [cs]
[6]
J. S. Chung, A. Nagrani, and A. Zisserman. 2018. VoxCeleb2: Deep Speaker Recognition. In INTERSPEECH.
[7]
J. S. Chung and A. Zisserman. 2016. Lip Reading in the Wild. In Asian Conference on Computer Vision.
[8]
Joon Son Chung and Andrew Zisserman. 2016. Out of time: automated lip sync in the wild. In Workshop on Multi-view Lip-reading, ACCV.
[9]
Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative Adversarial Networks. arXiv:1406.2661 [cs, stat](2014). arxiv:1406.2661 [cs, stat]
[10]
Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. 2018. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. arxiv:1706.08500 [cs.LG]
[11]
Amir Jamaludin, Joon Son Chung, and Andrew Zisserman. 2019. You Said That?: Synthesising Talking Faces from Audio. International Journal of Computer Vision 127, 11-12 (2019), 1767–1779. https://doi.org/10.1007/s11263-019-01150-y
[12]
Ming-Yu Liu, Xun Huang, Arun Mallya, Tero Karras, Timo Aila, Jaakko Lehtinen, and Jan Kautz. 2019. Few-Shot Unsupervised Image-to-Image Translation. arxiv:1905.01723 [cs.CV]
[13]
Alex Nichol, Joshua Achiam, and John Schulman. 2018. On First-Order Meta-Learning Algorithms. arxiv:1803.02999 [cs.LG]
[14]
Stavros Petridis, Themos Stafylakis, Pingchuan Ma, Feipeng Cai, Georgios Tzimiropoulos, and Maja Pantic. 2018. End-to-End Audiovisual Speech Recognition. arXiv:1802.06424 [cs](2018). arxiv:1802.06424 [cs]
[15]
K. R. Prajwal, Rudrabha Mukhopadhyay, Vinay Namboodiri, and C. V. Jawahar. 2020. A Lip Sync Expert Is All You Need for Speech to Lip Generation In The Wild. arXiv:2008.10010 [cs, eess](2020). https://doi.org/10.1145/3394171.3413532 arxiv:2008.10010 [cs, eess]
[16]
Supasorn Suwajanakorn, Steven M. Seitz, and Ira Kemelmacher-Shlizerman. 2017. Synthesizing Obama: Learning Lip Sync from Audio. ACM Transactions on Graphics 36, 4 (2017), 1–13. https://doi.org/10.1145/3072959.3073640
[17]
Ting-Chun Wang, Ming-Yu Liu, Andrew Tao, Guilin Liu, Jan Kautz, and Bryan Catanzaro. 2019. Few-Shot Video-to-Video Synthesis. arXiv:1910.12713 [cs](2019). arxiv:1910.12713 [cs]
[18]
Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. 2004. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing 13, 4 (2004), 600–612.
[19]
Ran Yi, Zipeng Ye, Juyong Zhang, Hujun Bao, and Yong-Jin Liu. 2020. Audio-Driven Talking Face Video Generation with Learning-Based Personalized Head Pose. arXiv:2002.10137 [cs](2020). arxiv:2002.10137 [cs]
[20]
Egor Zakharov, Aliaksandra Shysheya, Egor Burkov, and Victor Lempitsky. 2019. Few-Shot Adversarial Learning of Realistic Neural Talking Head Models. arXiv:1905.08233 [cs](2019). arxiv:1905.08233 [cs]
[21]
Hang Zhou, Yu Liu, Ziwei Liu, Ping Luo, and Xiaogang Wang. 2019. Talking Face Generation by Adversarially Disentangled Audio-Visual Representation. Proceedings of the AAAI Conference on Artificial Intelligence 33 (2019), 9299–9306. https://doi.org/10.1609/aaai.v33i01.33019299
[22]
Hang Zhou, Yasheng Sun, Wayne Wu, Chen Change Loy, Xiaogang Wang, and Ziwei Liu. 2021. Pose-Controllable Talking Face Generation by Implicitly Modularized Audio-Visual Representation. arXiv:2104.11116 [cs, eess](2021). arxiv:2104.11116 [cs, eess]
[23]
Hao Zhu, Huaibo Huang, Yi Li, Aihua Zheng, and Ran He. 2020. Arbitrary Talking Face Generation via Attentional Audio-Visual Coherence Learning. International Joint Conferences on Artificial Intelligence Organization, Yokohama, Japan, 2362–2368. https://doi.org/10.24963/ijcai.2020/327

Index Terms

  1. Few-shot Adversarial Audio Driving Talking Face Generation
        Index terms have been assigned to the content through auto-classification.

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image ACM Other conferences
        AISS '21: Proceedings of the 3rd International Conference on Advanced Information Science and System
        November 2021
        526 pages
        ISBN:9781450385862
        DOI:10.1145/3503047
        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 19 January 2022

        Permissions

        Request permissions for this article.

        Check for updates

        Author Tags

        1. Meta learning
        2. Neural networks
        3. Talking face generation
        4. Video generation

        Qualifiers

        • Research-article
        • Research
        • Refereed limited

        Conference

        AISS 2021

        Acceptance Rates

        Overall Acceptance Rate 41 of 95 submissions, 43%

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • 0
          Total Citations
        • 105
          Total Downloads
        • Downloads (Last 12 months)12
        • Downloads (Last 6 weeks)0
        Reflects downloads up to 14 Feb 2025

        Other Metrics

        Citations

        View Options

        Login options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        HTML Format

        View this article in HTML Format.

        HTML Format

        Figures

        Tables

        Media

        Share

        Share

        Share this Publication link

        Share on social media