Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3678957.3685729acmotherconferencesArticle/Chapter ViewAbstractPublication Pagesicmi-mlmiConference Proceedingsconference-collections
research-article
Open access

Multimodal User Enjoyment Detection in Human-Robot Conversation: The Power of Large Language Models

Published: 04 November 2024 Publication History

Abstract

Enjoyment is a crucial yet complex indicator of positive user experience in Human-Robot Interaction (HRI). While manual enjoyment annotation is feasible, developing reliable automatic detection methods remains a challenge. This paper investigates a multimodal approach to automatic enjoyment annotation for HRI conversations, leveraging large language models (LLMs), visual, audio, and temporal cues. Our findings demonstrate that both text-only and multimodal LLMs with carefully designed prompts can achieve performance comparable to human annotators in detecting user enjoyment. Furthermore, results reveal a stronger alignment between LLM-based annotations and user self-reports of enjoyment compared to human annotators. While multimodal supervised learning techniques did not improve all of our performance metrics, they could successfully replicate human annotators and highlighted the importance of visual and audio cues in detecting subtle shifts in enjoyment. This research demonstrates the potential of LLMs for real-time enjoyment detection, paving the way for adaptive companion robots that can dynamically enhance user experiences.

Supplemental Material

PDF File
Full prompt used in the paper's experiments.

References

[1]
Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, 2021. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021).
[2]
Tianqi Chen and Carlos Guestrin. 2016. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining. 785–794.
[3]
Jin Hyun Cheong, Eshin Jolly, Tiankang Xie, Sophie Byrne, Matthew Kenney, and Luke J Chang. 2023. Py-feat: Python facial expression analysis toolbox. Affective Science 4, 4 (2023), 781–796.
[4]
Neeraj Cherakara, Finny Varghese, Sheena Shabana, Nivan Nelson, Abhiram Karukayil, Rohith Kulothungan, Mohammed Farhan, Birthe Nesset, Meriam Moujahid, Tanvi Dinkar, Verena Rieser, and Oliver Lemon. 2023. FurChat: An Embodied Conversational Agent using LLMs, Combining Open and Closed-Domain Dialogue with Facial Expressions. In Proceedings of the 24th Annual Meeting of the Special Interest Group on Discourse and Dialogue (SigDIAL).
[5]
Martin D Cooney, Takayuki Kanda, Aris Alissandrakis, and Hiroshi Ishiguro. 2011. Interaction design for an enjoyable play interaction with a small humanoid robot. In 2011 11th IEEE-RAS International Conference on Humanoid Robots. IEEE, 112–119.
[6]
Mihaly Csikszentmihalyi, Reed Larson, 2014. Flow and the foundations of positive psychology. Vol. 10. Springer.
[7]
Paul Ekman. 1992. An argument for basic emotions. Cognition & emotion 6, 3-4 (1992), 169–200.
[8]
Jean Endicott, John Nee, Wilma Harrison, and Richard Blumenthal. 1993. Quality of Life Enjoyment and Satisfaction Questionnaire: a new measure.Psychopharmacology bulletin 29, 2 (1993), 321–326.
[9]
Florian Eyben and Björn Schuller. 2015. openSMILE:) The Munich open-source large-scale multimedia feature extractor. ACM SIGMultimedia Records 6, 4 (2015), 4–13.
[10]
Luke K. Fryer and Daniel L. Dinsmore. 2020. The Promise and Pitfalls of Self-report: Development, research design and analysis issues, and multiple methods. Frontline Learning Research 8, 3 (Mar. 2020), 1–9. https://doi.org/10.14786/flr.v8i3.623
[11]
Gemini team, Google. 2024. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv:https://storage.googleapis.com/deepmind-media/gemini/gemini_v1_5_report.pdf
[12]
Gemini team, Google. 2024. Gemini: A Family of Highly Capable Multimodal Models. arXiv:https://storage.googleapis.com/deepmind-media/gemini/gemini_1_report.pdf
[13]
Jonathan Ginzburg, Ellen Breitholtz, Robin Cooper, Julian Hough, and Ye Tian. 2015. Understanding laughter. In 20th Amsterdam Colloquium.
[14]
Hatice Gunes and Nikhil Churamani. 2023. Affective Computing for Human-Robot Interaction Research: Four Critical Lessons for the Hitchhiker. In 2023 32nd IEEE International Conference on Robot and Human Interactive Communication (RO-MAN). IEEE, 1565–1572.
[15]
Markku Haakana. 2010. Laughter and smiling: Notes on co-occurrences. Journal of Pragmatics 42, 6 (2010), 1499–1512.
[16]
Marcel Heerink, Ben Kröse, Vanessa Evers, and Bob Wielinga. 2010. Assessing acceptance of assistive social agent technology by older adults: the almere model. International Journal of Social Robotics 2 (2010), 361–375.
[17]
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation 9, 8 (1997), 1735–1780.
[18]
Guy Hoffman and Keinan Vanunu. 2013. Effects of robotic companionship on music enjoyment and agent perception. In 2013 8th ACM/IEEE International Conference on Human-Robot Interaction (HRI). IEEE, 317–324.
[19]
Bahar Irfan, James Kennedy, Séverin Lemaignan, Fotios Papadopoulos, Emmanuel Senft, and Tony Belpaeme. 2018. Social Psychology and Human-Robot Interaction: An Uneasy Marriage. In Companion of the 2018 ACM/IEEE International Conference on Human-Robot Interaction (Chicago, IL, USA). ACM, 13–20. https://doi.org/10.1145/3173386.3173389
[20]
Bahar Irfan, Sanna-Mari Kuoppamäki, and Gabriel Skantze. 2023. Between reality and delusion: challenges of applying large language models to companion robots for open-domain dialogues with older adults. (2023). https://doi.org/10.21203/rs.3.rs-2884789/v1
[21]
Bahar Irfan, Jura Miniota, Sofia Thunberg, Erik Lagerstedt, Sanna Kuoppamäki, Gabriel Skantze, and André Pereira. 2024. Human-Robot Interaction Conversational User Enjoyment Scale (HRI CUES). arXiv preprint arXiv:2405.01354 (2024).
[22]
Ziwei Ji, Tiezheng Yu, Yan Xu, Nayeon Lee, Etsuko Ishii, and Pascale Fung. 2023. Towards Mitigating Hallucination in Large Language Models via Self-Reflection. arXiv preprint arXiv:2310.06271 (2023).
[23]
Anoop K., Manjary P. Gangan, Deepak P., and Lajish V. L.2022. Towards an Enhanced Understanding of Bias in Pre-trained Neural Language Models: A Survey with Special Emphasis on Affective Bias. Springer Nature Singapore, 13–45. https://doi.org/10.1007/978-981-19-4453-6_2
[24]
Deborah Kendzierski and Kenneth J DeCarlo. 1991. Physical activity enjoyment scale: Two validation studies.Journal of sport & exercise psychology 13, 1 (1991).
[25]
Weslie Khoo, Long-Jing Hsu, Kyrie Jig Amon, Pranav Vijay Chakilam, Wei-Chu Chen, Zachary Kaufman, Agness Lungu, Hiroki Sato, Erin Seliger, Manasi Swaminathan, Katherine M. Tsui, David J. Crandall, and Selma Sabanović. 2023. Spill the Tea: When Robot Conversation Agents Support Well-Being for Older Adults. In Companion of the 2023 ACM/IEEE International Conference on Human-Robot Interaction (Stockholm, Sweden) (HRI ’23). Association for Computing Machinery, New York, NY, USA, 178–182. https://doi.org/10.1145/3568294.3580067
[26]
Hannah Rose Kirk, Bertie Vidgen, Paul Röttger, and Scott A. Hale. 2024. The benefits, risks and bounds of personalizing the alignment of large language models to individuals. Nature Machine Intelligence 6, 4 (2024), 383–392.
[27]
Michinari Kono and Koichi Araake. 2022. Is it Fun?: Understanding Enjoyment in Non-Game HCI Research. arXiv preprint arXiv:2209.02308 (2022).
[28]
Songpol Kulviwat, Gordon C Bruner II, Anand Kumar, Suzanne A Nasco, and Terry Clark. 2007. Toward a unified theory of consumer acceptance technology. Psychology & Marketing 24, 12 (2007), 1059–1084.
[29]
Yoon Kyung Lee, Yoonwon Jung, Gyuyi Kang, and Sowon Hahn. 2023. Developing Social Robots with Empathetic Non-Verbal Cues Using Large Language Models. In 2023 32nd IEEE International Conference on Robot & Human Interactive Communication (RO-MAN).
[30]
Florian Lingenfelser, Johannes Wagner, Elisabeth André, Gary McKeown, and Will Curran. 2014. An event driven fusion approach for enjoyment recognition in real-time. In Proceedings of the 22nd ACM international conference on Multimedia. 377–386.
[31]
Zhiwei Liu, Kailai Yang, Tianlin Zhang, Qianqian Xie, Zeping Yu, and Sophia Ananiadou. 2024. EmoLLMs: A Series of Emotional Large Language Models and Annotation Tools for Comprehensive Affective Analysis. arxiv:2401.08508 [cs.CL]
[32]
Benjamin Ma, Timothy Greer, Matthew Sachs, Assal Habibi, Jonas Kaplan, and Shrikanth Narayanan. 2019. Predicting human-reported enjoyment responses in happy and sad music. In 2019 8th International Conference on Affective Computing and Intelligent Interaction (ACII). IEEE, 607–613.
[33]
Kaixin Ma, Xinyu Wang, Xinru Yang, Mingtong Zhang, Jeffrey M Girard, and Louis-Philippe Morency. 2019. ElderReact: A Multimodal Dataset for Recognizing Emotional Response in Aging Adults. In 2019 International Conference on Multimodal Interaction (Suzhou, China) (ICMI ’19). Association for Computing Machinery, New York, NY, USA, 349–357. https://doi.org/10.1145/3340555.3353747
[34]
Elisa D Mekler, Julia Ayumi Bopp, Alexandre N Tuch, and Klaus Opwis. 2014. A systematic review of quantitative studies on the enjoyment of digital entertainment games. In Proceedings of the SIGCHI conference on human factors in computing systems. 927–936.
[35]
Jordan Miller and Troy McDaniel. 2022. I enjoyed the chance to meet you and I will always remember you: Healthy Older Adults’ Conversations with Misty the Robot. In 2022 17th ACM/IEEE International Conference on Human-Robot Interaction (HRI). IEEE, 914–918.
[36]
Spencer Ng, Ting-Han Lin, You Li, and Sarah Sebo. 2024. Role-Playing with Robot Characters: Increasing User Engagement through Narrative and Gameplay Agency. In Proceedings of the 2024 ACM/IEEE International Conference on Human-Robot Interaction. 522–532.
[37]
Shogo Nishimura, Takuya Nakamura, Wataru Sato, Masayuki Kanbara, Yuichiro Fujimoto, Hirokazu Kato, and Norihiro Hagita. 2021. Vocal Synchrony of Robots Boosts Positive Affective Empathy. Applied Sciences 11, 6 (Mar 2021), 2502. https://doi.org/10.3390/app11062502
[38]
Joanna Piasek and Katarzyna Wieczorowska-Tobis. 2018. Acceptance and Long-Term Use of a Social Robot by Elderly Users in a Domestic Environment. In 2018 11th International Conference on Human System Interaction (HSI). 478–482. https://doi.org/10.1109/HSI.2018.8431348
[39]
Jonathan Posner, James A Russell, and Bradley S Peterson. 2005. The circumplex model of affect: an integrative approach to affective neuroscience, cognitive development, and psychopathology.Dev Psychopathol 17, 3 (Summer 2005), 715–734.
[40]
C. Price. 2021. The Power of Fun: How to Feel Alive Again. Random House Publishing Group. https://books.google.se/books?id=ZgclEAAAQBAJ
[41]
Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. 2022. Robust Speech Recognition via Large-Scale Weak Supervision. arxiv:2212.04356 [eess.AS]
[42]
Navin Raj Prabhu, Chirag Raman, and Hayley Hung. 2021. Defining and Quantifying Conversation Quality in Spontaneous Interactions. In Companion Publication of the 2020 International Conference on Multimodal Interaction (Virtual Event, Netherlands) (ICMI ’20 Companion). Association for Computing Machinery, New York, NY, USA, 196–205. https://doi.org/10.1145/3395035.3425966
[43]
James A Russell. 1980. A circumplex model of affect.Journal of personality and social psychology 39, 6 (1980), 1161.
[44]
Hesam Sagha, Jun Deng, and Björn Schuller. 2017. The effect of personality trait, age, and gender on the performance of automatic speech valence recognition. In 2017 seventh international conference on affective computing and intelligent interaction (ACII). IEEE, 86–91.
[45]
Mostafa Al Masum Shaikh, Helmut Prendinger, and Ishizuka Mitsuru. 2007. Assessing sentiment of text by semantic dependency and contextual valence analysis. In Affective Computing and Intelligent Interaction: Second International Conference, ACII 2007 Lisbon, Portugal, September 12-14, 2007 Proceedings 2. Springer, 191–202.
[46]
Antti Suni, Juraj Šimko, Daniel Aalto, and Martti Vainio. 2017. Hierarchical representation and estimation of prosody using continuous wavelet transform. Computer Speech & Language 45 (2017), 123–136.
[47]
Viswanath Venkatesh, James YL Thong, and Xin Xu. 2012. Consumer acceptance and use of information technology: extending the unified theory of acceptance and use of technology. MIS quarterly (2012), 157–178.
[48]
MA Walker, DJ Litman, CA Kamm, and A Abella. 1998. Evaluating spoken dialogue agents with PARADISE: Two case studies. Computer Speech & Language 12, 4 (1998), 317–347. https://doi.org/10.1006/csla.1998.0110
[49]
Jin Wang, Liang-Chih Yu, K Robert Lai, and Xue-jie Zhang. 2015. A locally weighted method to improve linear regression for lexical-based valence-arousal prediction. In 2015 International Conference on Affective Computing and Intelligent Interaction (ACII). IEEE, 415–420.
[50]
Wenqing Wei, Sixia Li, Shogo Okada, and Kazunori Komatani. 2021. Multimodal User Satisfaction Recognition for Non-task Oriented Dialogue Systems. In Proceedings of the 2021 International Conference on Multimodal Interaction (Montréal, QC, Canada) (ICMI ’21). Association for Computing Machinery, New York, NY, USA, 586–594. https://doi.org/10.1145/3462244.3479928
[51]
Laura Weidinger, Jonathan Uesato, Maribeth Rauh, Conor Griffin, Po-Sen Huang, John Mellor, Amelia Glaese, Myra Cheng, Borja Balle, Atoosa Kasirzadeh, Courtney Biles, Sasha Brown, Zac Kenton, Will Hawkins, Tom Stepleton, Abeba Birhane, Lisa Anne Hendricks, Laura Rimell, William Isaac, Julia Haas, Sean Legassick, Geoffrey Irving, and Iason Gabriel. 2022. Taxonomy of Risks posed by Language Models. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency (, Seoul, Republic of Korea,) (FAccT ’22). Association for Computing Machinery, New York, NY, USA, 214–229. https://doi.org/10.1145/3531146.3533088
[52]
Jiaxin Xu, Chao Zhang, Raymond H Cuijpers, and Wijnand A IJsselsteijn. 2024. Affective and Cognitive Reactions to Robot-Initiated Social Control of Health Behaviors. In Proceedings of the 2024 ACM/IEEE International Conference on Human-Robot Interaction. 810–819.
[53]
Caleb Ziems, William Held, Omar Shaikh, Jiaao Chen, Zhehao Zhang, and Diyi Yang. 2024. Can Large Language Models Transform Computational Social Science?Computational Linguistics 50, 1 (03 2024), 237–291. https://doi.org/10.1162/coli_a_00502 arXiv:https://direct.mit.edu/coli/article-pdf/50/1/237/2367175/coli_a_00502.pdf

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
ICMI '24: Proceedings of the 26th International Conference on Multimodal Interaction
November 2024
725 pages
ISBN:9798400704628
DOI:10.1145/3678957
This work is licensed under a Creative Commons Attribution International 4.0 License.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 November 2024

Check for updates

Author Tags

  1. Affect Recognition
  2. Human-Robot Interaction
  3. Large Language Models
  4. Multimodal
  5. Older Adults
  6. User Enjoyment

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Funding Sources

  • Swedish Research Council
  • Digital Futures

Conference

ICMI '24
ICMI '24: INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION
November 4 - 8, 2024
San Jose, Costa Rica

Acceptance Rates

Overall Acceptance Rate 453 of 1,080 submissions, 42%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 227
    Total Downloads
  • Downloads (Last 12 months)227
  • Downloads (Last 6 weeks)102
Reflects downloads up to 12 Jan 2025

Other Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media