research-article

Few-shot Adversarial Audio Driving Talking Face Generation

Authors:

Shengwu XiongAuthors Info & Claims

AISS '21: Proceedings of the 3rd International Conference on Advanced Information Science and System

Article No.: 7, Pages 1 - 6

https://doi.org/10.1145/3503047.3503054

Published: 19 January 2022 Publication History

Abstract

Talking-face generation is an interesting and challenging problem in computer vision and has become a research focus. This project aims to generate real talking-face video sequences, especially lip synchronization and head motion. In order to create a personalized talking-face model, these works require training on large-scale audio-visual datasets. However, in many practical scenarios, the personalized appearance features, and audio-video synchronization relationships need to be learned from a few lip synchronization sequences. In this paper, we consider it as a few-shot image synchronization problem: synthesizing talking-face with audio if there are additionally a few lip-synchronized video sequences as the learning task? We apply the reptile methods to train the meta adversarial networks and this meta-model could be adapted on just a few references sequences and done quickly to learn the personalized references models. With meta-learning on the dataset, the model can learn the initialization parameters. And with few adapt steps on the reference sequences, the model can learn quickly and generate highly realistic images with more facial texture and lip-sync. Experiments on several datasets demonstrated significantly better results obtained by our methods than the state-of-the-art methods in both quantitative and quantitative comparisons.

References

[1]

2019. Talking Face Generation by Conditional Recurrent Adversarial Network. International Joint Conferences on Artificial Intelligence Organization, Macao, China, 919–925. https://doi.org/10.24963/ijcai.2019/129

[2]

Houwei Cao, David G Cooper, Michael K Keutmann, Ruben C Gur, Ani Nenkova, and Ragini Verma. 2014. CREMA-D: Crowd-sourced emotional multimodal actors dataset. IEEE transactions on affective computing 5, 4 (2014), 377–390.

[3]

Lele Chen, Guofeng Cui, Celong Liu, Zhong Li, Ziyi Kou, Yi Xu, and Chenliang Xu. 2020. Talking-Head Generation with Rhythmic Head Motion. arXiv:2007.08547 [cs](2020). arxiv:2007.08547 [cs]

[4]

Lele Chen, Zhiheng Li, Ross K. Maddox, Zhiyao Duan, and Chenliang Xu. 2018. Lip Movements Generation at a Glance. arXiv:1803.10404 [cs](2018). arxiv:1803.10404 [cs]

[5]

Lele Chen, Ross K. Maddox, Zhiyao Duan, and Chenliang Xu. 2019. Hierarchical Cross-Modal Talking Face Generationwith Dynamic Pixel-Wise Loss. arXiv:1905.03820 [cs](2019). arxiv:1905.03820 [cs]

[6]

J. S. Chung, A. Nagrani, and A. Zisserman. 2018. VoxCeleb2: Deep Speaker Recognition. In INTERSPEECH.

[7]

J. S. Chung and A. Zisserman. 2016. Lip Reading in the Wild. In Asian Conference on Computer Vision.

[8]

Joon Son Chung and Andrew Zisserman. 2016. Out of time: automated lip sync in the wild. In Workshop on Multi-view Lip-reading, ACCV.

[9]

Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative Adversarial Networks. arXiv:1406.2661 [cs, stat](2014). arxiv:1406.2661 [cs, stat]

[10]

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. 2018. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. arxiv:1706.08500 [cs.LG]

[11]

Amir Jamaludin, Joon Son Chung, and Andrew Zisserman. 2019. You Said That?: Synthesising Talking Faces from Audio. International Journal of Computer Vision 127, 11-12 (2019), 1767–1779. https://doi.org/10.1007/s11263-019-01150-y

Digital Library

[12]

Ming-Yu Liu, Xun Huang, Arun Mallya, Tero Karras, Timo Aila, Jaakko Lehtinen, and Jan Kautz. 2019. Few-Shot Unsupervised Image-to-Image Translation. arxiv:1905.01723 [cs.CV]

[13]

Alex Nichol, Joshua Achiam, and John Schulman. 2018. On First-Order Meta-Learning Algorithms. arxiv:1803.02999 [cs.LG]

[14]

Stavros Petridis, Themos Stafylakis, Pingchuan Ma, Feipeng Cai, Georgios Tzimiropoulos, and Maja Pantic. 2018. End-to-End Audiovisual Speech Recognition. arXiv:1802.06424 [cs](2018). arxiv:1802.06424 [cs]

[15]

K. R. Prajwal, Rudrabha Mukhopadhyay, Vinay Namboodiri, and C. V. Jawahar. 2020. A Lip Sync Expert Is All You Need for Speech to Lip Generation In The Wild. arXiv:2008.10010 [cs, eess](2020). https://doi.org/10.1145/3394171.3413532 arxiv:2008.10010 [cs, eess]

Digital Library

[16]

Supasorn Suwajanakorn, Steven M. Seitz, and Ira Kemelmacher-Shlizerman. 2017. Synthesizing Obama: Learning Lip Sync from Audio. ACM Transactions on Graphics 36, 4 (2017), 1–13. https://doi.org/10.1145/3072959.3073640

Digital Library

[17]

Ting-Chun Wang, Ming-Yu Liu, Andrew Tao, Guilin Liu, Jan Kautz, and Bryan Catanzaro. 2019. Few-Shot Video-to-Video Synthesis. arXiv:1910.12713 [cs](2019). arxiv:1910.12713 [cs]

[18]

Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. 2004. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing 13, 4 (2004), 600–612.

Digital Library

[19]

Ran Yi, Zipeng Ye, Juyong Zhang, Hujun Bao, and Yong-Jin Liu. 2020. Audio-Driven Talking Face Video Generation with Learning-Based Personalized Head Pose. arXiv:2002.10137 [cs](2020). arxiv:2002.10137 [cs]

[20]

Egor Zakharov, Aliaksandra Shysheya, Egor Burkov, and Victor Lempitsky. 2019. Few-Shot Adversarial Learning of Realistic Neural Talking Head Models. arXiv:1905.08233 [cs](2019). arxiv:1905.08233 [cs]

[21]

Hang Zhou, Yu Liu, Ziwei Liu, Ping Luo, and Xiaogang Wang. 2019. Talking Face Generation by Adversarially Disentangled Audio-Visual Representation. Proceedings of the AAAI Conference on Artificial Intelligence 33 (2019), 9299–9306. https://doi.org/10.1609/aaai.v33i01.33019299

Digital Library

[22]

Hang Zhou, Yasheng Sun, Wayne Wu, Chen Change Loy, Xiaogang Wang, and Ziwei Liu. 2021. Pose-Controllable Talking Face Generation by Implicitly Modularized Audio-Visual Representation. arXiv:2104.11116 [cs, eess](2021). arxiv:2104.11116 [cs, eess]

[23]

Hao Zhu, Huaibo Huang, Yi Li, Aihua Zheng, and Ran He. 2020. Arbitrary Talking Face Generation via Attentional Audio-Visual Coherence Learning. International Joint Conferences on Artificial Intelligence Organization, Yokohama, Japan, 2362–2368. https://doi.org/10.24963/ijcai.2020/327

Index Terms

Few-shot Adversarial Audio Driving Talking Face Generation
1. Computing methodologies

Index terms have been assigned to the content through auto-classification.

Recommendations

A Lip Sync Expert Is All You Need for Speech to Lip Generation In the Wild
MM '20: Proceedings of the 28th ACM International Conference on Multimedia

In this work, we investigate the problem of lip-syncing a talking face video of an arbitrary identity to match a target speech segment. Current works excel at producing accurate lip movements on a static image or videos of specific people seen during ...
Audio-Driven Talking Face Generation with Stabilized Synchronization Loss
Computer Vision – ECCV 2024
Abstract
Talking face generation aims to create realistic videos with accurate lip synchronization and high visual quality, using given audio and reference video while preserving identity and visual characteristics. In this paper, we start by identifying ...
A cantonese speech-driven talking face using translingual audio-to-visual conversion
ISCSLP'06: Proceedings of the 5th international conference on Chinese Spoken Language Processing

This paper proposes a novel approach towards a video- realistic, speech-driven talking face for Cantonese. We present a technique that realizes a talking face for a target language (Cantonese) using only audio-visual facial recordings for a base ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

AISS '21: Proceedings of the 3rd International Conference on Advanced Information Science and System

November 2021

526 pages

ISBN:9781450385862

DOI:10.1145/3503047

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 January 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

AISS 2021

AISS 2021: 2021 3rd International Conference on Advanced Information Science and System

November 26 - 28, 2021

Sanya, China

Acceptance Rates

Overall Acceptance Rate 41 of 95 submissions, 43%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
105
Total Downloads

Downloads (Last 12 months)12
Downloads (Last 6 weeks)0

Reflects downloads up to 14 Feb 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View Table of Conten