Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3581783.3612831acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Learning and Evaluating Human Preferences for Conversational Head Generation

Published: 27 October 2023 Publication History

Abstract

A reliable and comprehensive evaluation metric that aligns with manual preference assessments is crucial for conversational head video synthesis methods development. Existing quantitative evaluations often fail to capture the full complexity of human preference, as they only consider limited evaluation dimensions. Qualitative evaluations and user studies offer a solution but are time-consuming and labor-intensive. This limitation hinders the advancement of conversational head generation algorithms and systems. In this paper, we propose a novel learning-based evaluation metric named Preference Score (PS) for fitting human preference according to the quantitative evaluations across different dimensions. PS can serve as a quantitative evaluation without the need for human annotation. Experimental results validate the superiority of Preference Score in aligning with human perception, and also demonstrate robustness and generalizability to unseen data, making it a valuable tool for advancing conversation head generation. We expect this metric could facilitate new advances in conversational head generation. Project page: https://github.com/dc3ea9f/PreferenceScore.

References

[1]
Madhav Agarwal, Rudrabha Mukhopadhyay, Vinay P Namboodiri, and CV Jawahar. 2023. Audio-visual face reenactment. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 5178--5187.
[2]
Christopher Burges, Robert Ragno, and Quoc Le. 2006. Learning to rank with nonsmooth cost functions. Advances in neural information processing systems, Vol. 19 (2006).
[3]
Chris Burges, Tal Shaked, Erin Renshaw, Ari Lazier, Matt Deeds, Nicole Hamilton, and Greg Hullender. 2005. Learning to rank using gradient descent. In Proceedings of the 22nd international conference on Machine learning. 89--96.
[4]
Lele Chen, Guofeng Cui, Celong Liu, Zhong Li, Ziyi Kou, Yi Xu, and Chenliang Xu. 2020. Talking-head generation with rhythmic head motion. In European Conference on Computer Vision. Springer, 35--51.
[5]
Anna Veronika Dorogush, Vasily Ershov, and Andrey Gulin. 2018. CatBoost: gradient boosting with categorical features support. arXiv preprint arXiv:1810.11363 (2018).
[6]
Chenpng Du, Qi Chen, Tianyu He, Xu Tan, Xie Chen, Kai Yu, Sheng Zhao, and Jiang Bian. 2023. DAE-Talker: High Fidelity Speech-Driven Talking Face Generation with Diffusion Autoencoder. arXiv preprint arXiv:2303.17550 (2023).
[7]
Guy Gafni, Justus Thies, Michael Zollhofer, and Matthias Nießner. 2021. Dynamic neural radiance fields for monocular 4d facial avatar reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8649--8658.
[8]
Sahil Goyal, Shagun Uppal, Sarthak Bhagat, Yi Yu, Yifang Yin, and Rajiv Ratn Shah. 2023. Emotionally Enhanced Talking Face Generation. arXiv preprint arXiv:2303.11548 (2023).
[9]
Philip-William Grassal, Malte Prinzler, Titus Leistner, Carsten Rother, Matthias Nießner, and Justus Thies. 2022. Neural head avatars from monocular rgb videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18653--18664.
[10]
Jiazhi Guan, Zhanwang Zhang, Hang Zhou, Tianshu Hu, Kaisiyuan Wang, Dongliang He, Haocheng Feng, Jingtuo Liu, Errui Ding, Ziwei Liu, et al. 2023. StyleSync: High-Fidelity Generalized and Personalized Lip Sync in Style-based Generator. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1505--1515.
[11]
Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. 2017. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861 (2017).
[12]
Xinya Ji, Hang Zhou, Kaisiyuan Wang, Qianyi Wu, Wayne Wu, Feng Xu, and Xun Cao. 2022. Eamm: One-shot emotional talking face via audio-based emotion-aware motion model. In ACM SIGGRAPH 2022 Conference Proceedings. 1--10.
[13]
Taras Khakhulin, Vanessa Sklyarova, Victor Lempitsky, and Egor Zakharov. 2022. Realistic one-shot mesh-based head avatars. In European Conference on Computer Vision. Springer, 345--362.
[14]
Borong Liang, Yan Pan, Zhizhi Guo, Hang Zhou, Zhibin Hong, Xiaoguang Han, Junyu Han, Jingtuo Liu, Errui Ding, and Jingdong Wang. 2022. Expressive talking head generation with granular audio-visual control. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3387--3396.
[15]
Xian Liu, Yinghao Xu, Qianyi Wu, Hang Zhou, Wayne Wu, and Bolei Zhou. 2022. Semantic-aware implicit neural audio-driven video portrait generation. In European Conference on Computer Vision. Springer, 106--125.
[16]
Vinod Nair and Geoffrey E Hinton. 2010. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML-10). 807--814.
[17]
Evonne Ng, Hanbyul Joo, Liwen Hu, Hao Li, Trevor Darrell, Angjoo Kanazawa, and Shiry Ginosar. 2022. Learning to listen: Modeling non-deterministic dyadic facial motion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 20395--20405.
[18]
KR Prajwal, Rudrabha Mukhopadhyay, Vinay P Namboodiri, and CV Jawahar. 2020. A lip sync expert is all you need for speech to lip generation in the wild. In Proceedings of the 28th ACM international conference on multimedia. 484--492.
[19]
Liudmila Prokhorenkova, Gleb Gusev, Aleksandr Vorobev, Anna Veronika Dorogush, and Andrey Gulin. 2018. CatBoost: unbiased boosting with categorical features. Advances in neural information processing systems, Vol. 31 (2018).
[20]
Shuai Shen, Wenliang Zhao, Zibin Meng, Wanhua Li, Zheng Zhu, Jie Zhou, and Jiwen Lu. 2023. DiffTalk: Crafting Diffusion Models for Generalized Audio-Driven Portraits Animation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1982--1991.
[21]
Michał Stypułkowski, Konstantinos Vougioukas, Sen He, Maciej Zike ba, Stavros Petridis, and Maja Pantic. 2023. Diffused heads: Diffusion models beat gans on talking-face generation. arXiv preprint arXiv:2301.03396 (2023).
[22]
Jingxiang Sun, Xuan Wang, Lizhen Wang, Xiaoyu Li, Yong Zhang, Hongwen Zhang, and Yebin Liu. 2023. Next3d: Generative neural texture rasterization for 3d-aware head avatars. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 20991--21002.
[23]
Junshu Tang, Bo Zhang, Binxin Yang, Ting Zhang, Dong Chen, Lizhuang Ma, and Fang Wen. 2022. Explicitly controllable 3d-aware portrait generation. arXiv preprint arXiv:2209.05434 (2022).
[24]
Konstantinos Vougioukas, Stavros Petridis, and Maja Pantic. 2020. Realistic speech-driven facial animation with gans. International Journal of Computer Vision, Vol. 128 (2020), 1398--1413.
[25]
Jiadong Wang, Xinyuan Qian, Malu Zhang, Robby T Tan, and Haizhou Li. 2023. Seeing What You Said: Talking Face Generation Guided by a Lip Reading Expert. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14653--14662.
[26]
Suzhen Wang, Lincheng Li, Yu Ding, Changjie Fan, and Xin Yu. 2021. Audio2head: Audio-driven one-shot talking-head generation with natural head motion. arXiv preprint arXiv:2107.09293 (2021).
[27]
Shunyu Yao, RuiZhe Zhong, Yichao Yan, Guangtao Zhai, and Xiaokang Yang. 2022. DFA-NeRF: Personalized talking head generation via disentangled face attributes neural rendering. arXiv preprint arXiv:2201.00791 (2022).
[28]
Yu Yin, Kamran Ghasedi, HsiangTao Wu, Jiaolong Yang, Xin Tong, and Yun Fu. 2023. NeRFInvertor: High Fidelity NeRF-GAN Inversion for Single-shot Real Image Animation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8539--8548.
[29]
Bohan Zeng, Boyu Liu, Hong Li, Xuhui Liu, Jianzhuang Liu, Dapeng Chen, Wei Peng, and Baochang Zhang. 2022. FNeVR: Neural volume rendering for face animation. Advances in Neural Information Processing Systems, Vol. 35 (2022), 22451--22462.
[30]
Chenxu Zhang, Yifan Zhao, Yifei Huang, Ming Zeng, Saifeng Ni, Madhukar Budagavi, and Xiaohu Guo. 2021. Facial: Synthesizing dynamic talking face with implicit attribute learning. In Proceedings of the IEEE/CVF international conference on computer vision. 3867--3876.
[31]
Yufeng Zheng, Victoria Fernández Abrevaya, Marcel C Bühler, Xu Chen, Michael J Black, and Otmar Hilliges. 2022. Im avatar: Implicit morphable head avatars from videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13545--13555.
[32]
Hang Zhou, Yasheng Sun, Wayne Wu, Chen Change Loy, Xiaogang Wang, and Ziwei Liu. 2021. Pose-controllable talking face generation by implicitly modularized audio-visual representation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 4176--4186.
[33]
Mohan Zhou, Yalong Bai, Wei Zhang, Ting Yao, and Tiejun Zhao. 2023 a. Interactive Conversational Head Generation. arxiv: 2307.02090 [cs.CV]
[34]
Mohan Zhou, Yalong Bai, Wei Zhang, Ting Yao, Tiejun Zhao, and Tao Mei. 2022. Responsive Listening Head Generation: A Benchmark Dataset and Baseline. In Proceedings of the European conference on computer vision (ECCV).
[35]
Mohan Zhou, Yalong Bai, Wei Zhang, Ting Yao, Tiejun Zhao, and Tao Mei. 2023 b. Visual-Aware Text-to-Speech*. In ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 1--5. https://doi.org/10.1109/ICASSP49357.2023.10095084

Cited By

View all
  • (2023)Improvements on SadTalker-based Approach for ViCo Conversational Head Generation ChallengeProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612866(9566-9570)Online publication date: 26-Oct-2023

Index Terms

  1. Learning and Evaluating Human Preferences for Conversational Head Generation

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MM '23: Proceedings of the 31st ACM International Conference on Multimedia
    October 2023
    9913 pages
    ISBN:9798400701085
    DOI:10.1145/3581783
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 27 October 2023

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. conversational head generation
    2. digital human
    3. evaluation metric

    Qualifiers

    • Research-article

    Conference

    MM '23
    Sponsor:
    MM '23: The 31st ACM International Conference on Multimedia
    October 29 - November 3, 2023
    Ottawa ON, Canada

    Acceptance Rates

    Overall Acceptance Rate 995 of 4,171 submissions, 24%

    Upcoming Conference

    MM '24
    The 32nd ACM International Conference on Multimedia
    October 28 - November 1, 2024
    Melbourne , VIC , Australia

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)197
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 03 Oct 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)Improvements on SadTalker-based Approach for ViCo Conversational Head Generation ChallengeProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612866(9566-9570)Online publication date: 26-Oct-2023

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media