Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3581783.3611705acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Cultural Self-Adaptive Multimodal Gesture Generation Based on Multiple Culture Gesture Dataset

Published: 27 October 2023 Publication History

Abstract

Co-speech gesture generation is essential for multimodal chatbots and agents. Previous research extensively studies the relationship between text, audio, and gesture. Meanwhile, to enhance cross-culture communication, culture-specific gestures are crucial for chatbots to learn cultural differences and incorporate cultural cues. However, culture-specific gesture generation faces two challenges: lack of large-scale, high-quality gesture datasets that include diverse cultural groups, and lack of generalization across different cultures. Therefore, in this paper, we first introduce a Multiple Culture Gesture Dataset (MCGD), the largest freely available gesture dataset to date. It consists of ten different cultures, over 200 speakers, and 10,000 segmented sequences. We further propose a Cultural Self-adaptive Gesture Generation Network (CSGN) that takes multimodal relationships into consideration while generating gestures using a cascade architecture and learnable dynamic weight. The CSGN adaptively generates gestures with different cultural characteristics without the need to retrain a new network. It extracts cultural features from the multimodal inputs or a cultural style embedding space with a designated culture. We broadly evaluate our method across four large-scale benchmark datasets. Empirical results show that our method achieves multiple cultural gesture generation and improves comprehensiveness of multimodal inputs. Our method improves the state-of-the-art average FGD from 53.7 to 48.0 and culture deception rate (CDR) from 33.63% to 39.87%.

References

[1]
Chaitanya Ahuja, Dong Won Lee, Yukiko I Nakano, and Louis-Philippe Morency. 2020. Style transfer for co-speech gesture animation: A multi-speaker conditional-mixture approach. In Computer Vision-ECCV 2020: 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part XVIII 16. Springer, 248--265.
[2]
Chaitanya Ahuja and Louis-Philippe Morency. 2019. Language2pose: Natural language grounded pose forecasting. In 2019 International Conference on 3D Vision (3DV). IEEE, 719--728.
[3]
Simon Alexanderson, Gustav Eje Henter, Taras Kucherenko, and Jonas Beskow. 2020a. Style-Controllable Speech-Driven Gesture Synthesis Using Normalising Flows. In Computer Graphics Forum, Vol. 39. Wiley Online Library, 487--496.
[4]
Simon Alexanderson, Éva Székely, Gustav Eje Henter, Taras Kucherenko, and Jonas Beskow. 2020b. Generating coherent spontaneous speech and gesture from text. In Proceedings of the 20th ACM International Conference on Intelligent Virtual Agents. 1--3.
[5]
Ghazanfar Ali, Myungho Lee, and Jae-In Hwang. 2020. Automatic text-to-gesture rule generation for embodied conversational agents. Computer Animation and Virtual Worlds, Vol. 31, 4--5 (2020), e1944.
[6]
David F Armstrong, William C Stokoe, and Sherman E Wilcox. 1995. Gesture and the nature of language. Cambridge University Press.
[7]
Shaojie Bai, J Zico Kolter, and Vladlen Koltun. 2018. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv preprint arXiv:1803.01271 (2018).
[8]
Uttaran Bhattacharya, Nicholas Rewkowski, Abhishek Banerjee, Pooja Guhan, Aniket Bera, and Dinesh Manocha. 2021. Text2gestures: A transformer-based network for generating emotive body gestures for virtual agents. In 2021 IEEE virtual reality and 3D user interfaces (VR). IEEE, 1--10.
[9]
Lera Boroditsky. 2001. Does language shape thought?: Mandarin and English speakers' conceptions of time. Cognitive psychology, Vol. 43, 1 (2001), 1--22.
[10]
Paul Bremner, Anthony G Pipe, Chris Melhuish, Mike Fraser, and Sriram Subramanian. 2011. The effects of robot-performed co-verbal gesture on listener behaviour. In 2011 11th IEEE-RAS International Conference on Humanoid Robots. IEEE, 458--465.
[11]
Alec Burmania, Srinivas Parthasarathy, and Carlos Busso. 2015. Increasing the reliability of crowdsourcing evaluations using online quality assessment. IEEE Transactions on Affective Computing, Vol. 7, 4 (2015), 374--388.
[12]
Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. 2017. Realtime multi-person 2d pose estimation using part affinity fields. In Proceedings of the IEEE conference on computer vision and pattern recognition. 7291--7299.
[13]
Justine Cassell, David McNeill, and Karl-Erik McCullough. 1999. Speech-gesture mismatches: Evidence for one underlying representation of linguistic and nonlinguistic information. Pragmatics & cognition, Vol. 7, 1 (1999), 1--34.
[14]
Haibo Chen, Lei Zhao, Zhizhong Wang, Huiming Zhang, Zhiwen Zuo, Ailin Li, Wei Xing, and Dongming Lu. 2021. Dualast: Dual style-learning networks for artistic style transfer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 872--881.
[15]
Lena Darweesh. 2010. Nonverbal Behaviour as Communication: The Arabian Coffee Making Ritual. International Journal of Interdisciplinary Social Sciences, Vol. 5, 7 (2010).
[16]
Riccardo Del Gratta, Sara Goggi, Gabriella Pardelli, and Nicoletta Calzolari. 2021. Correction to: The LRE Map: what does it tell us about the last decade of our field? Language Resources and Evaluation, Vol. 55 (2021), 285--286.
[17]
Ylva Ferstl and Rachel McDonnell. 2018. Investigating the use of recurrent motion modelling for speech gesture generation. In Proceedings of the 18th International Conference on Intelligent Virtual Agents. 93--98.
[18]
Ylva Ferstl, Michael Neff, and Rachel McDonnell. 2020. Adversarial gesture generation with realistic gesture phasing. Computers & Graphics, Vol. 89 (2020), 117--130.
[19]
Anna Jia Gander, Nataliya Berbyuk Lindström, and Pierre Gander. 2021. Expressing Agreement in Swedish and Chinese: A Case Study of Communicative Feedback in First-Time Encounters. In Cross-Cultural Design. Experience and Product Design Across Cultures: 13th International Conference, CCD 2021, Held as Part of the 23rd HCI International Conference, HCII 2021, Virtual Event, July 24--29, 2021, Proceedings, Part I 23. Springer, 390--407.
[20]
Leon A Gatys, Alexander S Ecker, and Matthias Bethge. 2016. Image style transfer using convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2414--2423.
[21]
Shiry Ginosar, Amir Bar, Gefen Kohavi, Caroline Chan, Andrew Owens, and Jitendra Malik. 2019. Learning individual styles of conversational gesture. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3497--3506.
[22]
Edouard Grave, Piotr Bojanowski, Prakhar Gupta, Armand Joulin, and Tomas Mikolov. 2018. Learning word vectors for 157 languages. arXiv preprint arXiv:1802.06893 (2018).
[23]
Ikhsanul Habibie, Weipeng Xu, Dushyant Mehta, Lingjie Liu, Hans-Peter Seidel, Gerard Pons-Moll, Mohamed Elgharib, and Christian Theobalt. 2021. Learning speech-driven 3d conversational gestures from video. In Proceedings of the 21st ACM International Conference on Intelligent Virtual Agents. 101--108.
[24]
Edward T Hall. 1976. Beyond culture. Anchor.
[25]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.
[26]
Gustav Eje Henter, Simon Alexanderson, and Jonas Beskow. 2020. Moglow: Probabilistic and controllable motion synthesis using normalising flows. ACM Transactions on Graphics (TOG), Vol. 39, 6 (2020), 1--14.
[27]
Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. 2017. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, Vol. 30 (2017).
[28]
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation, Vol. 9, 8 (1997), 1735--1780.
[29]
Peter J Huber. 1992. Robust estimation of a location parameter. Breakthroughs in statistics: Methodology and distribution (1992), 492--518.
[30]
Ali Jahanian, Lucy Chai, and Phillip Isola. 2019. On the "steerability" of generative adversarial networks. arXiv preprint arXiv:1907.07171 (2019).
[31]
Armand Joulin, Edouard Grave, Piotr Bojanowski, Matthijs Douze, Hérve Jégou, and Tomas Mikolov. 2016b. Fasttext. zip: Compressing text classification models. arXiv preprint arXiv:1612.03651 (2016).
[32]
Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. 2016a. Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759 (2016).
[33]
Vinothini Kasinathan, Aida Mustapha, and Chow Khai Bin. 2021. A Customizable multilingual chatbot system for customer support. Annals of Emerging Technologies in Computing (AETiC), Vol. 5, 5 (2021), 51--59.
[34]
Adam Kendon. 2004. Gesture: Visible action as utterance. Cambridge University Press.
[35]
Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
[36]
Diederik P Kingma and Max Welling. 2013. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013).
[37]
Sotaro Kita. 2009. Cross-cultural variation of speech-accompanying gesture: A review. Language and cognitive processes, Vol. 24, 2 (2009), 145--167.
[38]
L Viola Kozak and Nozomi Tomita. 2012. On selected phonological patterns in Saudi Arabian Sign Language. Sign Language Studies, Vol. 13, 1 (2012), 56--78.
[39]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2017. Imagenet classification with deep convolutional neural networks. Commun. ACM, Vol. 60, 6 (2017), 84--90.
[40]
Gilwoo Lee, Zhiwei Deng, Shugao Ma, Takaaki Shiratori, Siddhartha S Srinivasa, and Yaser Sheikh. 2019. Talking with hands 16.2 m: A large-scale dataset of synchronized body-finger motion and audio for conversational motion analysis and synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 763--772.
[41]
Jing Li, Di Kang, Wenjie Pei, Xuefei Zhe, Ying Zhang, Zhenyu He, and Linchao Bao. 2021. Audio2gestures: Generating diverse gestures from speech audio with conditional variational autoencoders. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 11293--11302.
[42]
Haiyang Liu, Zihao Zhu, Naoya Iwamoto, Yichen Peng, Zhengqing Li, You Zhou, Elif Bozkurt, and Bo Zheng. 2022b. BEAT: A Large-Scale Semantic and Emotional Multi-Modal Dataset for Conversational Gestures Synthesis. In Computer Vision-ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part VII. Springer, 612--630.
[43]
Xian Liu, Qianyi Wu, Hang Zhou, Yinghao Xu, Rui Qian, Xinyi Lin, Xiaowei Zhou, Wayne Wu, Bo Dai, and Bolei Zhou. 2022a. Learning hierarchical cross-modal association for co-speech gesture generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10462--10472.
[44]
Reza Lotfian and Carlos Busso. 2017. Building naturalistic emotionally balanced speech corpus by retrieving emotional speech from existing podcast recordings. IEEE Transactions on Affective Computing, Vol. 10, 4 (2017), 471--483.
[45]
JinHong Lu, TianHang Liu, ShuZhuang Xu, and Hiroshi Shimodaira. 2021. Double-dcccae: Estimation of body gestures from speech waveform. In ICASSP 2021--2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 900--904.
[46]
Shuhong Lu and Andrew Feng. 2022. The DeepMotion entry to the GENEA Challenge 2022. In Proceedings of the 2022 International Conference on Multimodal Interaction. 790--796.
[47]
Michael McAuliffe, Michaela Socolof, Sarah Mihuc, Michael Wagner, and Morgan Sonderegger. 2017. Montreal forced aligner: Trainable text-speech alignment using kaldi. In Interspeech, Vol. 2017. 498--502.
[48]
Costanza Navarretta, Elisabeth Ahlsén, Jens Allwood, Kristiina Jokinen, and Patrizia Paggio. 2011. Creating comparable multimodal corpora for nordic languages. In Proceedings of the 18th Nordic Conference of Computational Linguistics (NODALIDA 2011). 153--160.
[49]
Rafael E Núnez and Eve Sweetser. 2006. With the future behind them: Convergent evidence from Aymara language and gesture in the crosslinguistic comparison of spatial construals of time. Cognitive science, Vol. 30, 3 (2006), 401--450.
[50]
Patrizia Paggio and Costanza Navarretta. 2011. Feedback and gestural behaviour in a conversational corpus of Danish. NEALT (Northen European Association of Language Technology) Proceedings Series (2011), 33--39.
[51]
Dario Pavllo, Christoph Feichtenhofer, David Grangier, and Michael Auli. 2019. 3d human pose estimation in video with temporal convolutions and semi-supervised training. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 7753--7762.
[52]
Vladimir Popescu, Lin Liu, Riccardo Del Gratta, Khalid Choukri, and Nicoletta Calzolari. 2016. New developments in the LRE map. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16). 4526--4530.
[53]
Shenhan Qian, Zhi Tu, Yihao Zhi, Wen Liu, and Shenghua Gao. 2021. Speech drives templates: Co-speech gesture synthesis with learned templates. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 11077--11086.
[54]
Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. 2014. Stochastic backpropagation and approximate inference in deep generative models. In International conference on machine learning. PMLR, 1278--1286.
[55]
David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. 1985. Learning internal representations by error propagation. Technical Report. California Univ San Diego La Jolla Inst for Cognitive Science.
[56]
Artsiom Sanakoyeu, Dmytro Kotovenko, Sabine Lang, and Bjorn Ommer. 2018. A style-aware content loss for real-time hd style transfer. In proceedings of the European conference on computer vision (ECCV). 698--714.
[57]
Michael Studdert-Kennedy. 1994. Hand and Mind: What Gestures Reveal About Thought. Language and Speech, Vol. 37, 2 (1994), 203--209.
[58]
Kenta Takeuchi, Dai Hasegawa, Shinichi Shirakawa, Naoshi Kaneko, Hiroshi Sakuta, and Kazuhiko Sumi. 2017. Speech-to-gesture generation: A challenge in deep learning approach with bi-directional LSTM. In Proceedings of the 5th International Conference on Human Agent Interaction. 365--369.
[59]
Emmi Toivio and Kristiina Jokinen. 2012. Multimodal Feedback Signaling in Finnish. In Baltic HLT. 247--255.
[60]
Daniela Trotta and Raffaele Guarasci. 2021. How are gestures used by politicians? A multimodal co-gesture analysis. IJCoL. Italian Journal of Computational Linguistics, Vol. 7, 7--1, 2 (2021), 45--66.
[61]
Aaron Van Den Oord, Oriol Vinyals, et al. 2017. Neural discrete representation learning. Advances in neural information processing systems, Vol. 30 (2017).
[62]
Andrea Vidal, Ali Salman, Wei-Cheng Lin, and Carlos Busso. 2020. MSP-face corpus: a natural audiovisual emotional database. In Proceedings of the 2020 international conference on multimodal interaction. 397--405.
[63]
Ekaterina Volkova, Stephan De La Rosa, Heinrich H Bülthoff, and Betty Mohler. 2014. The MPI emotional body expressions database for narrative scenarios. PloS one, Vol. 9, 12 (2014), e113647.
[64]
Petra Wagner, Zofia Malisz, and Stefan Kopp. 2014. Gesture and speech in interaction: An overview. Speech Communication, Vol. 57 (2014), 209--232. https://doi.org/10.1016/j.specom.2013.09.008
[65]
Jason R Wilson, Nah Young Lee, Annie Saechao, Sharon Hershenson, Matthias Scheutz, and Linda Tickle-Degnen. 2017. Hand gestures and verbal acknowledgments improve human-robot rapport. In Social Robotics: 9th International Conference, ICSR 2017, Tsukuba, Japan, November 22-24, 2017, Proceedings 9. Springer, 334--344.
[66]
Bowen Wu, Chaoran Liu, Carlos Toshinori Ishi, and Hiroshi Ishiguro. 2021a. Modeling the conditional distribution of co-speech upper body gesture jointly using conditional-GAN and unrolled-GAN. Electronics, Vol. 10, 3 (2021), 228.
[67]
Bowen Wu, Chaoran Liu, Carlos T Ishi, and Hiroshi Ishiguro. 2021b. Probabilistic human-like gesture synthesis from speech using GRU-based WGAN. In Companion Publication of the 2021 International Conference on Multimodal Interaction. 194--201.
[68]
Jingyu Wu, Shi Chen, Wei Xiang, Lingyun Sun, Hongzeng Zhang, Zhenyu Zhang, and Yanxu Li. 2023 a. CNAMD corpus: A Chinese Natural Audiovisual Multimodal Database of Conversations for Social Interactive Agent. International Journal of Human-Computer Interaction (2023). https://doi.org/10.1080/10447318.2023.2228530
[69]
Jingyu Wu, Lefan Hou, Zejian Li, Jun Liao, Li Liu, and Lingyun Sun. 2023 b. Preserving Structural Consistency in Arbitrary Artist and Artwork Style Transfer. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 37. 2830--2838.
[70]
Elizabeth Würtz. 2005. Intercultural communication on web sites: A cross-cultural analysis of web sites from high-context cultures and low-context cultures. Journal of computer-mediated communication, Vol. 11, 1 (2005), 274--299.
[71]
Wilson Yan, Yunzhi Zhang, Pieter Abbeel, and Aravind Srinivas. 2021. Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:2104.10157 (2021).
[72]
Yi Yang and Deva Ramanan. 2012. Articulated human detection with flexible mixtures of parts. IEEE transactions on pattern analysis and machine intelligence, Vol. 35, 12 (2012), 2878--2890.
[73]
Youngwoo Yoon, Bok Cha, Joo-Haeng Lee, Minsu Jang, Jaeyeon Lee, Jaehong Kim, and Geehyuk Lee. 2020. Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics (TOG), Vol. 39, 6 (2020), 1--16.
[74]
Youngwoo Yoon, Woo-Ri Ko, Minsu Jang, Jaeyeon Lee, Jaehong Kim, and Geehyuk Lee. 2019. Robots learn social skills: End-to-end learning of co-speech gesture generation for humanoid robots. In 2019 International Conference on Robotics and Automation (ICRA). IEEE, 4303--4309.
[75]
Yu Zhang, Wei Han, James Qin, Yongqiang Wang, Ankur Bapna, Zhehuai Chen, Nanxin Chen, Bo Li, Vera Axelrod, Gary Wang, et al. 2023. Google usm: Scaling automatic speech recognition beyond 100 languages. arXiv preprint arXiv:2303.01037 (2023).

Cited By

View all
  • (2024)Emotional Speech-Driven 3D Body Animation via Disentangled Latent Diffusion2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52733.2024.00190(1942-1953)Online publication date: 16-Jun-2024

Index Terms

  1. Cultural Self-Adaptive Multimodal Gesture Generation Based on Multiple Culture Gesture Dataset

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image ACM Conferences
        MM '23: Proceedings of the 31st ACM International Conference on Multimedia
        October 2023
        9913 pages
        ISBN:9798400701085
        DOI:10.1145/3581783
        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Sponsors

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 27 October 2023

        Permissions

        Request permissions for this article.

        Check for updates

        Author Tags

        1. co-speech gesture generation
        2. datasets
        3. evaluation metric
        4. multimodal chatbots
        5. nonverbal behavior

        Qualifiers

        • Research-article

        Funding Sources

        • the Ng Teng Fong Charitable Foundation in the form of ZJU-SUTD IDEA Grant

        Conference

        MM '23
        Sponsor:
        MM '23: The 31st ACM International Conference on Multimedia
        October 29 - November 3, 2023
        Ottawa ON, Canada

        Acceptance Rates

        Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)133
        • Downloads (Last 6 weeks)16
        Reflects downloads up to 26 Jan 2025

        Other Metrics

        Citations

        Cited By

        View all
        • (2024)Emotional Speech-Driven 3D Body Animation via Disentangled Latent Diffusion2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52733.2024.00190(1942-1953)Online publication date: 16-Jun-2024

        View Options

        Login options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Figures

        Tables

        Media

        Share

        Share

        Share this Publication link

        Share on social media