Article

Talking-Head Generation with Rhythmic Head Motion

Authors:

Chenliang XuAuthors Info & Claims

Computer Vision – ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IX

Pages 35 - 51

https://doi.org/10.1007/978-3-030-58545-7_3

Published: 23 August 2020 Publication History

Abstract

When people deliver a speech, they naturally move heads, and this rhythmic head motion conveys prosodic information. However, generating a lip-synced video while moving head naturally is challenging. While remarkably successful, existing works either generate still talking-face videos or rely on landmark/video frames as sparse/dense mapping guidance to generate head movements, which leads to unrealistic or uncontrollable video synthesis. To overcome the limitations, we propose a 3D-aware generative network along with a hybrid embedding module and a non-linear composition module. Through modeling the head motion and facial expressions (In our setting, facial expression means facial movement (e.g., blinks, and lip & chin movements).) explicitly, manipulating 3D animation carefully, and embedding reference images dynamically, our approach achieves controllable, photo-realistic, and temporally coherent talking-head videos with natural head movements. Thoughtful experiments on several standard benchmarks demonstrate that our method achieves significantly better results than the state-of-the-art methods in both quantitative and qualitative comparisons. The code is available on https://github.com/lelechen63/Talking-head-Generation-with-Rhythmic-Head-Motion.

References

[1]

Afouras, T., Chung, J.S., Zisserman, A.: LRS3-TED: a large-scale dataset for visual speech recognition. In: arXiv preprint arXiv:1809.00496 (2018)

[2]

Bregler, C., Covell, M., Slaney, M.: Video rewrite: driving visual speech with audio. In: Proceedings of the 24th Annual Conference on Computer Graphics and Interactive Techniques, pp. 353–360 (1997)

[3]

Cao H, Cooper DG, Keutmann MK, Gur RC, Nenkova A, and Verma R CREMA-D: crowd-sourced emotional multimodal actors dataset IEEE Trans. Affect. Comput. 2014 5 4 377-390

[4]

Cassell J, McNeill D, and McCullough KE Speech-gesture mismatches: evidence for one underlying representation of linguistic and nonlinguistic information Pragmat. Cogn. 1999 7 1 1-34

[5]

Chang, Y.J., Ezzat, T.: Transferable videorealistic speech animation. In: Proceedings of the 2005 ACM SIGGRAPH/Eurographics Symposium on Computer Animation, pp. 143–151. ACM (2005)

[6]

Chen, L., Li, Z., K Maddox, R., Duan, Z., Xu, C.: Lip movements generation at a glance. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 520–535 (2018)

[7]

Chen, L., Maddox, R.K., Duan, Z., Xu, C.: Hierarchical cross-modal talking face generation with dynamic pixel-wise loss. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7832–7841 (2019)

[8]

Chung, J.S., Jamaludin, A., Zisserman, A.: You said that? In: British Machine Vision Conference (2017)

[9]

Chung, J.S., Nagrani, A., Zisserman, A.: VoxCeleb2: deep speaker recognition. In: INTERSPEECH (2018)

[10]

Chung JS and Zisserman A Lai S-H, Lepetit V, Nishino K, and Sato Y Lip reading in the wild Computer Vision – ACCV 2016 2017 Cham Springer 87-103

[11]

Deng, J., Guo, J., Xue, N., Zafeiriou, S.: ArcFace: additive angular margin loss for deep face recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4690–4699 (2019)

[12]

Feng, Y., Wu, F., Shao, X., Wang, Y., Zhou, X.: Joint 3D face reconstruction and dense alignment with position map regression network. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 534–551 (2018)

[13]

Fried O et al. Text-based editing of talking-head video ACM Trans. Graph. (TOG) 2019 38 4 1-14

Digital Library

[14]

Garrido, P., et al.: VDub: modifying face video of actors for plausible visual alignment to a dubbed audio track. In: Computer Graphics Forum, vol. 34, pp. 193–204. Wiley Online Library (2015)

[15]

Ginosar, S., Bar, A., Kohavi, G., Chan, C., Owens, A., Malik, J.: Learning individual styles of conversational gesture. In: Computer Vision and Pattern Recognition (CVPR). IEEE (2019)

[16]

Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: GANs trained by a two time-scale update rule converge to a local nash equilibrium. In: Advances in Neural Information Processing Systems, pp. 6626–6637 (2017)

[17]

Kim H et al. Deep video portraits ACM Trans. Graph. (TOG) 2018 37 4 1-14

Digital Library

[18]

Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)

[19]

Liu, K., Ostermann, J.: Realistic facial expression synthesis for an image-based talking head. In: 2011 IEEE International Conference on Multimedia and Expo, pp. 1–6. IEEE (2011)

[20]

Liu, M.Y., et al.: Few-shot unsupervised image-to-image translation. In: IEEE International Conference on Computer Vision (ICCV) (2019)

[21]

Liu, S., Li, T., Chen, W., Li, H.: Soft rasterizer: a differentiable renderer for image-based 3d reasoning. The IEEE International Conference on Computer Vision (ICCV), October 2019

[22]

Munhall KG, Jones JA, Callan DE, Kuratate T, and Vatikiotis-Bateson E Visual prosody and speech intelligibility: head movement improves auditory speech perception Psychol. Sci. 2004 15 2 133-137

[23]

Park, T., Liu, M.Y., Wang, T.C., Zhu, J.Y.: Semantic image synthesis with spatially-adaptive normalization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2337–2346 (2019)

[24]

Pumarola, A., Agudo, A., Martinez, A.M., Sanfeliu, A., Moreno-Noguer, F.: GANimation: one-shot anatomically consistent facial animation. Int. J. Comput. Vis. 1–16 (2019)

[25]

Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: International Conference on Learning Representations (2015)

[26]

Song, Y., Zhu, J., Li, D., Wang, A., Qi, H.: Talking face generation by conditional recurrent adversarial network. In: Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-19, pp. 919–925. International Joint Conferences on Artificial Intelligence Organization, July 2019.

[27]

Suwajanakorn S, Seitz SM, and Kemelmacher-Shlizerman I Synthesizing obama: learning lip sync from audio ACM Trans. Graph. (TOG) 2017 36 4 95

Digital Library

[28]

Tian, Y., Shi, J., Li, B., Duan, Z., Xu, C.: Audio-visual event localization in unconstrained videos. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 247–263 (2018)

[29]

Vondrick, C., Pirsiavash, H., Torralba, A.: Generating videos with scene dynamics. In: Advances in Neural Information Processing Systems, pp. 613–621 (2016)

[30]

Vougioukas, K., Petridis, S., Pantic, M.: Realistic speech-driven facial animation with GANs. Int. J. Comput. Vis. 1–16 (2019)

[31]

Wang, T.C., Liu, M.Y., Tao, A., Liu, G., Kautz, J., Catanzaro, B.: Few-shot video-to-video synthesis. In: Advances in Neural Information Processing Systems (NeurIPS) (2019)

[32]

Wang, T.C., et al.: Video-to-video synthesis. In: Advances in Neural Information Processing Systems (NeurIPS) (2018)

[33]

Wang, T.C., Liu, M.Y., Zhu, J.Y., Tao, A., Kautz, J., Catanzaro, B.: High-resolution image synthesis and semantic manipulation with conditional GANs. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8798–8807 (2018)

[34]

Wang Z, Bovik AC, Sheikh HR, Simoncelli EP, et al. Image quality assessment: from error visibility to structural similarity IEEE Trans. Image Process. 2004 13 4 600-612

Digital Library

[35]

Wiles, O., Sophia Koepke, A., Zisserman, A.: X2Face: a network for controlling face generation using images, audio, and pose codes. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 670–686 (2018)

[36]

Yoo, S., Bahng, H., Chung, S., Lee, J., Chang, J., Choo, J.: Coloring with limited data: few-shot colorization via memory augmented networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 11283–11292 (2019)

[37]

Zakharov, E., Shysheya, A., Burkov, E., Lempitsky, V.: Few-shot adversarial learning of realistic neural talking head models. In: The IEEE International Conference on Computer Vision (ICCV), October 2019

[38]

Zhou, H., Liu, J., Liu, Z., Liu, Y., Wang, X.: Rotate-and-render: unsupervised photorealistic face rotation from single-view images. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2020)

[39]

Zhou, H., Liu, Y., Liu, Z., Luo, P., Wang, X.: Talking face generation by adversarially disentangled audio-visual representation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 9299–9306 (2019)

[40]

Zhu, X., Lei, Z., Liu, X., Shi, H., Li, S.Z.: Face alignment across large poses: a 3D solution. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 146–155 (2016)

Cited By

Liu ZLiu XChen SLiu JWang LBi C(2024)Multimodal Fusion for Talking Face Generation Utilizing Speech-Related Facial Action UnitsACM Transactions on Multimedia Computing, Communications, and Applications10.1145/367256520:9(1-24)Online publication date: 17-Jun-2024
https://dl.acm.org/doi/10.1145/3672565
Airale LAlameda-Pineda XLathuilière SVaufreydaz D(2023)Autoregressive GAN for Semantic Unconditional Head Motion GenerationACM Transactions on Multimedia Computing, Communications, and Applications10.1145/3635154Online publication date: 6-Dec-2023
https://dl.acm.org/doi/10.1145/3635154
Zhou MBai YZhang WYao TZhao TMei TEl Saddik AMei TCucchiara RBertini MTobon Vallejo DAtrey PHossain M(2023)Learning and Evaluating Human Preferences for Conversational Head GenerationProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612831(9615-9619)Online publication date: 26-Oct-2023
https://dl.acm.org/doi/10.1145/3581783.3612831
Show More Cited By

Recommendations

Realistic talking face animation with speech-induced head motion
ICVGIP '21: Proceedings of the Twelfth Indian Conference on Computer Vision, Graphics and Image Processing

The recent advancements on talking face generation from speech have mostly focused on lip synchronization, realistic facial movements like eye blinks, eye brow motions but do not generate meaningful head motions according to the speech. This results in ...
Live Speech Driven Head-and-Eye Motion Generators

This paper describes a fully automated framework to generate realistic head motion, eye gaze, and eyelid motion simultaneously based on live (or recorded) speech input. Its central idea is to learn separate yet interrelated statistical models for each ...
Eye&Head: Synergetic Eye and Head Movement for Gaze Pointing and Selection
UIST '19: Proceedings of the 32nd Annual ACM Symposium on User Interface Software and Technology

Eye gaze involves the coordination of eye and head movement to acquire gaze targets, but existing approaches to gaze pointing are based on eye-tracking in abstraction from head motion. We propose to leverage the synergetic movement of eye and head, and ...

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings

Computer Vision – ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IX

Aug 2020

860 pages

ISBN:978-3-030-58544-0

DOI:10.1007/978-3-030-58545-7

Editors:
Andrea Vedaldi
University of Oxford, Oxford, UK
,
Horst Bischof
Graz University of Technology, Graz, Austria
,
Thomas Brox
University of Freiburg, Freiburg im Breisgau, Germany
,
Jan-Michael Frahm
University of North Carolina at Chapel Hill, Chapel Hill, NC, USA

© Springer Nature Switzerland AG 2020.

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 23 August 2020

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

15
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 21 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Liu ZLiu XChen SLiu JWang LBi C(2024)Multimodal Fusion for Talking Face Generation Utilizing Speech-Related Facial Action UnitsACM Transactions on Multimedia Computing, Communications, and Applications10.1145/367256520:9(1-24)Online publication date: 17-Jun-2024
https://dl.acm.org/doi/10.1145/3672565
Airale LAlameda-Pineda XLathuilière SVaufreydaz D(2023)Autoregressive GAN for Semantic Unconditional Head Motion GenerationACM Transactions on Multimedia Computing, Communications, and Applications10.1145/3635154Online publication date: 6-Dec-2023
https://dl.acm.org/doi/10.1145/3635154
Zhou MBai YZhang WYao TZhao TMei TEl Saddik AMei TCucchiara RBertini MTobon Vallejo DAtrey PHossain M(2023)Learning and Evaluating Human Preferences for Conversational Head GenerationProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612831(9615-9619)Online publication date: 26-Oct-2023
https://dl.acm.org/doi/10.1145/3581783.3612831
Liu JWang XFu XChai YYu CDai JHan JEl Saddik AMei TCucchiara RBertini MTobon Vallejo DAtrey PHossain M(2023)MFR-Net: Multi-faceted Responsive Listening Head Generation via Denoising Diffusion ModelProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612123(6734-6743)Online publication date: 26-Oct-2023
https://dl.acm.org/doi/10.1145/3581783.3612123
Liu SWang H(2023)Talking Face Generation via Facial AnatomyACM Transactions on Multimedia Computing, Communications, and Applications10.1145/357174619:3(1-19)Online publication date: 25-Feb-2023
https://dl.acm.org/doi/10.1145/3571746
Pan YZhang RCheng STan SDing YMitchell KYang X(2023)Emotional Voice PuppetryIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2023.324710129:5(2527-2535)Online publication date: 1-May-2023
https://dl.acm.org/doi/10.1109/TVCG.2023.3247101
Cai ZGhosh SDhall AGedeon TStefanov KHayat M(2023) Glitch in the matrixComputer Vision and Image Understanding10.1016/j.cviu.2023.103818236:COnline publication date: 1-Nov-2023
https://dl.acm.org/doi/10.1016/j.cviu.2023.103818
Fan YLin ZSaito JWang WKomura T(2022)Joint Audio-Text Model for Expressive Speech-Driven 3D Facial AnimationProceedings of the ACM on Computer Graphics and Interactive Techniques10.1145/35226155:1(1-15)Online publication date: 4-May-2022
https://dl.acm.org/doi/10.1145/3522615
Salehi PHassan SShafiee Sabet SAstrid Baugerud GSinkerud Johnson MHalvorsen PRiegler MDao MDang-Nguyen DRiegler M(2022)Is More Realistic Better? A Comparison of Game Engine and GAN-based Avatars for Investigative Interviews of ChildrenProceedings of the 3rd ACM Workshop on Intelligent Cross-Data Analysis and Retrieval10.1145/3512731.3534209(41-49)Online publication date: 27-Jun-2022
https://dl.acm.org/doi/10.1145/3512731.3534209
Huang AHuang ZZhou SMagalhães Jdel Bimbo ASatoh SSebe NAlameda-Pineda XJin QOria VToni L(2022)Perceptual Conversational Head Generation with Regularized Driver and Enhanced RendererProceedings of the 30th ACM International Conference on Multimedia10.1145/3503161.3551577(7050-7054)Online publication date: 10-Oct-2022
https://dl.acm.org/doi/10.1145/3503161.3551577
Show More Cited By

View Options

View options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents