research-article

Control-Talker: A Rapid-Customization Talking Head Generation Method for Multi-Condition Control and High-Texture Enhancement

Authors:

Yiding Li,

Lingyun Yu,

Li Wang,

Hongtao XieAuthors Info & Claims

MM '24: Proceedings of the 32nd ACM International Conference on Multimedia

Pages 3519 - 3527

https://doi.org/10.1145/3664647.3681644

Published: 28 October 2024 Publication History

Get Access

Abstract

In recent years, the field of talking head generation has made significant strides. However, the need for substantial computational resources for model training, coupled with a scarcity of high-quality video data, poses challenges for the rapid customization of model to specific individual. Additionally, existing models usually only support single-modal control, lacking the ability to generate vivid facial expressions and controllable head poses based on multiple conditions such as audio, video, etc. These limitations restricts the models' widespread application. In this paper, we introduce a two-stage method called Control-Talker to achieve rapid customization of identity in talking head model and high-quality generation based on multimodal conditions. Specifically, we divide the training process into two stages: prior learning stage and identity rapid-customization stage. 1) In the prior learning stage, we leverage a diffusion-based model pre-trained on the high-quality image dataset to acquire a robust controllable facial prior. Meanwhile, we innovatively propose a high-frequency ControlNet structure to enhance the fidelity of the synthesized results. This structure adeptly extracts a high-frequency feature map from the source image, serving as a facial texture prior, thereby excellently preserving facial texture of the source image. 2) In the identity rapid-customization stage, the identity is fixed by fine-tuning the U-Net part of the diffusion model on merely several images of a specific individual. The entire fine-tuning process for identity customization can be completed within approximately ten minutes, thereby significantly reducing training costs. Further, we propose a unified driving method for both audio and video, enabling the model to precisely control expressions, poses, and lighting under multi conditions. Extensive experiments and visual results demonstrate that our method outperforms other state-of-the-art models. Additionally, our model demonstrates reduced training costs and lower data requirements.

References

[1]

Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. 2020. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems, Vol. 33 (2020), 12449--12460.

Abstract

References

Index Terms

Recommendations

FD2Talk: Towards Generalized Talking Head Generation with Facial Decoupled Diffusion Model

DAE-Talker: High Fidelity Speech-Driven Talking Face Generation with Diffusion Autoencoder

EmoTalker: Audio Driven Emotion Aware Talking Head Generation

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Login options

Full Access

View options

PDF

eReader

Share

Share this Publication link

Share on social media

Affiliations