Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3503161.3548002acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article
Open access

D2Animator: Dual Distillation of StyleGAN For High-Resolution Face Animation

Published: 10 October 2022 Publication History

Abstract

The style-based generator architectures (e.g. StyleGAN v1, v2) largely promote the controllability and explainability of Generative Adversarial Networks (GANs). Many researchers have applied the pretrained style-based generators to image manipulation and video editing by exploring the correlation between linear interpolation in the latent space and semantic transformation in the synthesized image manifold. However, most previous studies focused on manipulating separate discrete attributes, which is insufficient to animate a still image to generate videos with complex and diverse poses and expressions. In this work, we devise a dual distillation strategy (D2Animator) for generating animated high-resolution face videos conditioned on identities and poses from different images. Specifically, we first introduce a Clustering-based Distiller (CluDistiller) to distill diverse interpolation directions in the latent space, and synthesize identity-consistent faces with various poses and expressions, such as blinking, frowning, looking up/down, etc. Then we propose an Augmentation-based Distiller (AugDistiller) that learns to encode arbitrary face deformation into a combination of interpolation directions via training on augmentation samples synthesized by CluDistiller. Through assembling the two distillation methods, D2Animator can generate high-resolution face animation videos without training on video sequences. Extensive experiments on self-driving, cross-identity and sequence-driving tasks demonstrate the superiority of the proposed D2Animator over existing StyleGAN manipulation and face animation methods in both generation quality and animation fidelity.

Supplementary Material

MP4 File (MM2022_camera_ready_d_2_animator_dual_distillation.mp4)
"Presentation video", "High-Resolution Face Animation"

References

[1]
Rameen Abdal, Yipeng Qin, and Peter Wonka. 2019. Image2stylegan: how to embed images into the stylegan latent space? In Proceedings of the IEEE/CVF International Conference on Computer Vision, 4432--4441.
[2]
Rameen Abdal, Yipeng Qin, and Peter Wonka. 2020. Image2stylegan: how to edit the embedded images? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8296--8305.
[3]
Rameen Abdal, Peihao Zhu, Niloy Mitra, and Peter Wonka. 2021. Labels4free: unsupervised segmentation using stylegan. arXiv preprint arXiv:2103.14968.
[4]
Rameen Abdal, Peihao Zhu, Niloy J Mitra, and Peter Wonka. 2021. Styleflow: attribute-conditioned exploration of stylegan-generated images using conditional continuous normalizing flows. ACM Transactions on Graphics (TOG), 40, 3, 1--21.
[5]
Hadar Averbuch-Elor, Daniel Cohen-Or, Johannes Kopf, and Michael F Cohen. 2017. Bringing portraits to life. ACM Transactions on Graphics (TOG), 36, 6, 196.
[6]
Dmitri Bitouk, Neeraj Kumar, Samreen Dhillon, Peter Belhumeur, and Shree K Nayar. 2008. Face swapping: automatically replacing faces in photographs. In ACM Transactions on Graphics (TOG) number 3. Vol. 27. ACM, 39.
[7]
Konstantinos Bousmalis, Nathan Silberman, David Dohan, Dumitru Erhan, and Dilip Krishnan. 2017. Unsupervised pixel-level domain adaptation with generative adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), 3722--3731.
[8]
Adrian Bulat and Georgios Tzimiropoulos. 2017. How far are we from solving the 2d & 3d face alignment problem? (and a dataset of 230,000 3d facial landmarks). In International Conference on Computer Vision.
[9]
Renwang Chen, Xuanhong Chen, Bingbing Ni, and Yanhao Ge. 2020. Simswap: an efficient framework for high fidelity face swapping. In Proceedings of the 28th ACM International Conference on Multimedia, 2003--2011.
[10]
Zhuo Chen, Chaoyue Wang, Bo Yuan, and Dacheng Tao. 2020. Puppeteergan: arbitrary portrait animation with semantic-aware appearance transformation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 13518--13527.
[11]
Mengyu Chu, You Xie, Jonas Mayer, Laura Leal-Taixé, and Nils Thuerey. 2020. Learning temporal coherence via self-supervision for gan-based video generation. ACM Transactions on Graphics (TOG), 39, 4, 75--1.
[12]
Edo Collins, Raja Bala, Bob Price, and Sabine Susstrunk. 2020. Editing in style: uncovering the local semantics of gans. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5771--5780.
[13]
Jiankang Deng, Jia Guo, Xue Niannan, and Stefanos Zafeiriou. 2019. Arcface: additive angular margin loss for deep face recognition. In CVPR.
[14]
Michail Christos Doukas, Stefanos Zafeiriou, and Viktoriia Sharmanska. 2021. Headgan: one-shot neural head synthesis and editing. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 14398--14407.
[15]
Yuki Endo and Yoshihiro Kanamori. 2021. Few-shot semantic image synthesis using stylegan prior. arXiv preprint arXiv:2103.14877.
[16]
Pablo Garrido, Levi Valgaerts, Ole Rehmsen, Thorsten Thormahlen, Patrick Perez, and Christian Theobalt. 2014. Automatic face reenactment. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 4217--4224.
[17]
Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron C. Courville, and Yoshua Bengio. 2014. Generative adversarial nets. In NIPS.
[18]
Hao Guan, Chaoyue Wang, and Dacheng Tao. 2021. Mri-based Alzheimer's disease prediction via distilling the knowledge in multi-modal data. NeuroImage, 244, 118586.
[19]
Yuxuan Han, Jiaolong Yang, and Ying Fu. 2021. Disentangled face attribute editing via instance-aware latent space search. arXiv preprint arXiv:2105.12660.
[20]
Erik Härkönen, Aaron Hertzmann, Jaakko Lehtinen, and Sylvain Paris. 2020. Ganspace: discovering interpretable gan controls. arXiv preprint arXiv:2004.02546.
[21]
Fengxiang He and Dacheng Tao. 2020. Recent advances in deep learning theory. arXiv preprint arXiv:2012.10931.
[22]
Fengxiang He, Bohan Wang, and Dacheng Tao. 2020. Piecewise linear activations substantially shape the loss surfaces of neural networks. In International Conference on Learning Representations.
[23]
Kaiming He, X. Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 770--778.
[24]
Xun Huang and Serge Belongie. 2017. Arbitrary style transfer in real-time with adaptive instance normalization. In Proceedings of the IEEE International Conference on Computer Vision, 1501--1510.
[25]
Aapo Hyvärinen and Erkki Oja. 2000. Independent component analysis: algorithms and applications. Neural networks : the official journal of the International Neural Network Society, 13 4--5, 411--30.
[26]
Omer Kafri, Or Patashnik, Yuval Alaluf, and Daniel Cohen-Or. 2021. Stylefusion: a generative model for disentangling spatial segments. arXiv preprint arXiv:2107.07437.
[27]
Tero Karras, Miika Aittala, Janne Hellsten, Samuli Laine, Jaakko Lehtinen, and Timo Aila. 2020. Training generative adversarial networks with limited data. arXiv preprint arXiv:2006.06676.
[28]
Tero Karras, Samuli Laine, and Timo Aila. 2019. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4401--4410.
[29]
Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. 2020. Analyzing and improving the image quality of stylegan. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8110--8119.
[30]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. Imagenet classification with deep convolutional neural networks. Communications of the ACM, 60, 84--90.
[31]
Sam Kwong, Jialu Huang, and Jing Liao. 2021. Unsupervised image-to-image translation via pre-trained stylegan2 network. IEEE Transactions on Multimedia.
[32]
Cheng-Han Lee, Ziwei Liu, Lingyun Wu, and Ping Luo. 2020. Maskgan: towards diverse and interactive facial image manipulation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[33]
Kathleen M Lewis, Srivatsan Varadharajan, and Ira Kemelmacher-Shlizerman. 2021. Vogue: try-on by stylegan interpolation optimization. arXiv preprint arXiv:2101.02285.
[34]
Daiqing Li, Junlin Yang, Karsten Kreis, Antonio Torralba, and Sanja Fidler. 2021. Semantic segmentation with generative models: semi-supervised learning and strong out-of-domain generalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8300--8311.
[35]
Lingzhi Li, Jianmin Bao, Hao Yang, Dong Chen, and FangWen. 2020. Advancing high fidelity identity swapping for forgery detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5074--5083.
[36]
Huan Ling, Karsten Kreis, Daiqing Li, Seung Wook Kim, Antonio Torralba, and Sanja Fidler. 2021. Editgan: high-precision semantic image editing. In Thirty-Fifth Conference on Neural Information Processing Systems.
[37]
Yu-Ding Lu, Hsin-Ying Lee, Hung-Yu Tseng, and Ming-Hsuan Yang. 2020. Unsupervised discovery of disentangled manifolds in gans. arXiv preprint arXiv:2011.11842.
[38]
Tianxiang Ma, Dongze Li,WeiWang, and Jing Dong. 2021. Face anonymization by manipulating decoupled identity representation. arXiv preprint arXiv:2105.11137.
[39]
Arsha Nagrani, Joon Son Chung, and Andrew Zisserman. 2017. Voxceleb: a large-scale speaker identification dataset. Telephony, 3, 33--039.
[40]
Yuval Nirkin, Yosi Keller, and Tal Hassner. 2019. Fsgan: subject agnostic face swapping and reenactment. In Proceedings of the IEEE/CVF international conference on computer vision, 7184--7193.
[41]
Taesung Park, Ming-Yu Liu, Ting-ChunWang, and Jun-Yan Zhu. 2019. Semantic image synthesis with spatially-adaptive normalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2337--2346.
[42]
Or Patashnik, Zongze Wu, Eli Shechtman, Daniel Cohen-Or, and Dani Lischinski. 2021. Styleclip: text-driven manipulation of stylegan imagery. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2085--2094.
[43]
Antoine Plumerault, Hervé Le Borgne, and Céline Hudelot. 2020. Controlling generative models with continuous factors of variations. arXiv preprint arXiv:2001.10238.
[44]
Elad Richardson, Yuval Alaluf, Or Patashnik, Yotam Nitzan, Yaniv Azar, Stav Shapiro, and Daniel Cohen-Or. 2021. Encoding in style: a stylegan encoder for image-to-image translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2287--2296.
[45]
Tim Salimans, Ian J. Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. 2016. Improved techniques for training gans. In NIPS, 2226--2234. http://papers.nips.cc/paper/6125-improved-techniques-for-training-ga-ns.
[46]
Yujun Shen, Jinjin Gu, Xiaoou Tang, and Bolei Zhou. 2020. Interpreting the latent space of gans for semantic face editing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 9243--9252.
[47]
Yujun Shen and Bolei Zhou. 2021. Closed-form factorization of latent semantics in gans. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1532--1540.
[48]
Aliaksandr Siarohin, Stéphane Lathuilière, Sergey Tulyakov, Elisa Ricci, and Nicu Sebe. 2019. Animating arbitrary objects via deep motion transfer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2377--2386.
[49]
Aliaksandr Siarohin, Stéphane Lathuilière, Sergey Tulyakov, Elisa Ricci, and Nicu Sebe. 2019. First order motion model for image animation. Advances in Neural Information Processing Systems, 32, 7137--7147.
[50]
Aliaksandr Siarohin, Oliver J Woodford, Jian Ren, Menglei Chai, and Sergey Tulyakov. 2021. Motion representations for articulated animation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 13653--13662.
[51]
Robert R. Sokal and Peter H. A. Sneath. 1961. Principles of Numerical Taxonomy. W. H. Freeman.
[52]
Ayush Tewari, Mohamed Elgharib, Gaurav Bharaj, Florian Bernard, Hans-Peter Seidel, Patrick Pérez, Michael Zollhofer, and Christian Theobalt. 2020. Stylerig: rigging stylegan for 3d control over portrait images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 6142--6151.
[53]
Yu Tian, Jian Ren, Menglei Chai, Kyle Olszewski, Xi Peng, Dimitris N Metaxas, and Sergey Tulyakov. 2021. A good image generator is what you need for high-resolution video synthesis. arXiv preprint arXiv:2104.15069.
[54]
Sergey Tulyakov, Ming-Yu Liu, Xiaodong Yang, and Jan Kautz. 2018. Mocogan: decomposing motion and content for video generation. In Proceedings of the IEEE conference on computer vision and pattern recognition, 1526--1535.
[55]
Tengfei Wang, Yong Zhang, Yanbo Fan, Jue Wang, and Qifeng Chen. 2022. High-fidelity gan inversion for image attribute editing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[56]
Ting-Chun Wang, Ming-Yu Liu, Andrew Tao, Guilin Liu, Jan Kautz, and Bryan Catanzaro. 2019. Few-shot video-to-video synthesis. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, 5013--5024.
[57]
Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Guilin Liu, Andrew Tao, Jan Kautz, and Bryan Catanzaro. 2018. Video-to-video synthesis. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, 1152--1164.
[58]
Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Andrew Tao, Jan Kautz, and Bryan Catanzaro. 2018. High-resolution image synthesis and semantic manipulation with conditional gans. In Proceedings of the IEEE conference on computer vision and pattern recognition, 8798--8807.
[59]
Ting-Chun Wang, Arun Mallya, and Ming-Yu Liu. 2021. One-shot free-view neural talking-head synthesis for video conferencing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10039--10049.
[60]
YaohuiWang, Di Yang, Francois Bremond, and Antitza Dantcheva. 2021. Latent image animator: learning to animate images via latent space navigation. In International Conference on Learning Representations.
[61]
Olivia Wiles, A Koepke, and Andrew Zisserman. 2018. X2face: a network for controlling face generation using images, audio, and pose codes. In Proceedings of the European conference on computer vision (ECCV), 670--686.
[62]
Jianjin Xu and Changxi Zheng. 2021. Linear semantics in generative adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 9351--9360.
[63]
Ceyuan Yang, Yujun Shen, and Bolei Zhou. 2021. Semantic hierarchy emerges in deep generative representations for scene synthesis. International Journal of Computer Vision, 129, 5, 1451--1466.
[64]
Zuopeng Yang, Daqing Liu, Chaoyue Wang, Jie Yang, and Dacheng Tao. 2022. Modeling image composition for complex scene generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7764--7773.
[65]
Xu Yao, Alasdair Newson, Yann Gousseau, and Pierre Hellier. 2021. A latent transformer for disentangled face editing in images and videos. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 13789--13798.
[66]
Fei Yin et al. 2022. Styleheat: one-shot high-resolution editable talking face generation via pretrained stylegan. arXiv preprint arXiv:2203.04036.
[67]
Shan You, Tao Huang, Mingmin Yang, Fei Wang, Chen Qian, and Changshui Zhang. 2020. Greedynas: towards fast one-shot nas with greedy supernet. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1999--2008.
[68]
Shan You, Chang Xu, Chao Xu, and Dacheng Tao. 2017. Learning from multiple teacher networks. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1285--1294.
[69]
Baosheng Yu and Dacheng Tao. 2021. Heatmap regression via randomized rounding. IEEE Transactions on Pattern Analysis and Machine Intelligence.
[70]
Egor Zakharov, Aleksei Ivakhnenko, Aliaksandra Shysheya, and Victor Lempitsky. 2020. Fast bi-layer neural synthesis of one-shot realistic head avatars. In European Conference of Computer vision (ECCV). (Aug. 2020).
[71]
Egor Zakharov, Aliaksandra Shysheya, Egor Burkov, and Victor Lempitsky. 2019. Few-shot adversarial learning of realistic neural talking head models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 9459--9468.
[72]
Yuxuan Zhang, Huan Ling, Jun Gao, Kangxue Yin, Jean-Francois Lafleche, Adela Barriuso, Antonio Torralba, and Sanja Fidler. 2021. Datasetgan: efficient labeled data factory with minimal human effort. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10145--10155.
[73]
Haimei Zhao, Wei Bian, Bo Yuan, and Dacheng Tao. 2020. Collaborative learning of depth estimation, visual odometry and camera relocalization from monocular videos. In IJCAI, 488--494.
[74]
Yuhao Zhu, Qi Li, Jian Wang, Cheng-Zhong Xu, and Zhenan Sun. 2021. One shot face swapping on megapixels. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4834--4844.

Cited By

View all
  • (2024)On Robust Cross-view Consistency in Self-supervised Monocular Depth EstimationMachine Intelligence Research10.1007/s11633-023-1474-021:3(495-513)Online publication date: 21-Mar-2024

Index Terms

  1. D2Animator: Dual Distillation of StyleGAN For High-Resolution Face Animation

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MM '22: Proceedings of the 30th ACM International Conference on Multimedia
    October 2022
    7537 pages
    ISBN:9781450392037
    DOI:10.1145/3503161
    This work is licensed under a Creative Commons Attribution International 4.0 License.

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 10 October 2022

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. GANs
    2. face animation
    3. high-resolution image generation

    Qualifiers

    • Research-article

    Funding Sources

    • Science and Technology Innovation 2030 ""Brain Science and Brain-like Research" Major Project

    Conference

    MM '22
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 995 of 4,171 submissions, 24%

    Upcoming Conference

    MM '24
    The 32nd ACM International Conference on Multimedia
    October 28 - November 1, 2024
    Melbourne , VIC , Australia

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)197
    • Downloads (Last 6 weeks)25
    Reflects downloads up to 03 Sep 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)On Robust Cross-view Consistency in Self-supervised Monocular Depth EstimationMachine Intelligence Research10.1007/s11633-023-1474-021:3(495-513)Online publication date: 21-Mar-2024

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Get Access

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media