research-article

Open access

Multimodal User Enjoyment Detection in Human-Robot Conversation: The Power of Large Language Models

Authors:

Lubos Marcinek,

Sofia Thunberg,

Erik Lagerstedt,

Joakim Gustafson,

Gabriel Skantze,

Bahar IrfanAuthors Info & Claims

ICMI '24: Proceedings of the 26th International Conference on Multimodal Interaction

Pages 469 - 478

https://doi.org/10.1145/3678957.3685729

Published: 04 November 2024 Publication History

All formats PDF

Abstract

Enjoyment is a crucial yet complex indicator of positive user experience in Human-Robot Interaction (HRI). While manual enjoyment annotation is feasible, developing reliable automatic detection methods remains a challenge. This paper investigates a multimodal approach to automatic enjoyment annotation for HRI conversations, leveraging large language models (LLMs), visual, audio, and temporal cues. Our findings demonstrate that both text-only and multimodal LLMs with carefully designed prompts can achieve performance comparable to human annotators in detecting user enjoyment. Furthermore, results reveal a stronger alignment between LLM-based annotations and user self-reports of enjoyment compared to human annotators. While multimodal supervised learning techniques did not improve all of our performance metrics, they could successfully replicate human annotators and highlighted the importance of visual and audio cues in detecting subtle shifts in enjoyment. This research demonstrates the potential of LLMs for real-time enjoyment detection, paving the way for adaptive companion robots that can dynamically enhance user experiences.

Supplemental Material

PDF File

Full prompt used in the paper's experiments.

Download
925.98 KB

References

[1]

Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, 2021. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021).

[2]

Tianqi Chen and Carlos Guestrin. 2016. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining. 785–794.

Digital Library

[3]

Jin Hyun Cheong, Eshin Jolly, Tiankang Xie, Sophie Byrne, Matthew Kenney, and Luke J Chang. 2023. Py-feat: Python facial expression analysis toolbox. Affective Science 4, 4 (2023), 781–796.

[4]

Neeraj Cherakara, Finny Varghese, Sheena Shabana, Nivan Nelson, Abhiram Karukayil, Rohith Kulothungan, Mohammed Farhan, Birthe Nesset, Meriam Moujahid, Tanvi Dinkar, Verena Rieser, and Oliver Lemon. 2023. FurChat: An Embodied Conversational Agent using LLMs, Combining Open and Closed-Domain Dialogue with Facial Expressions. In Proceedings of the 24th Annual Meeting of the Special Interest Group on Discourse and Dialogue (SigDIAL).

[5]

Martin D Cooney, Takayuki Kanda, Aris Alissandrakis, and Hiroshi Ishiguro. 2011. Interaction design for an enjoyable play interaction with a small humanoid robot. In 2011 11th IEEE-RAS International Conference on Humanoid Robots. IEEE, 112–119.

[6]

Mihaly Csikszentmihalyi, Reed Larson, 2014. Flow and the foundations of positive psychology. Vol. 10. Springer.

[7]

Paul Ekman. 1992. An argument for basic emotions. Cognition & emotion 6, 3-4 (1992), 169–200.

[8]

Jean Endicott, John Nee, Wilma Harrison, and Richard Blumenthal. 1993. Quality of Life Enjoyment and Satisfaction Questionnaire: a new measure.Psychopharmacology bulletin 29, 2 (1993), 321–326.

[9]

Florian Eyben and Björn Schuller. 2015. openSMILE:) The Munich open-source large-scale multimedia feature extractor. ACM SIGMultimedia Records 6, 4 (2015), 4–13.

Digital Library

[10]

Luke K. Fryer and Daniel L. Dinsmore. 2020. The Promise and Pitfalls of Self-report: Development, research design and analysis issues, and multiple methods. Frontline Learning Research 8, 3 (Mar. 2020), 1–9. https://doi.org/10.14786/flr.v8i3.623

[11]

Gemini team, Google. 2024. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv:https://storage.googleapis.com/deepmind-media/gemini/gemini_v1_5_report.pdf

[12]

Gemini team, Google. 2024. Gemini: A Family of Highly Capable Multimodal Models. arXiv:https://storage.googleapis.com/deepmind-media/gemini/gemini_1_report.pdf

[13]

Jonathan Ginzburg, Ellen Breitholtz, Robin Cooper, Julian Hough, and Ye Tian. 2015. Understanding laughter. In 20th Amsterdam Colloquium.

[14]

Hatice Gunes and Nikhil Churamani. 2023. Affective Computing for Human-Robot Interaction Research: Four Critical Lessons for the Hitchhiker. In 2023 32nd IEEE International Conference on Robot and Human Interactive Communication (RO-MAN). IEEE, 1565–1572.

[15]

Markku Haakana. 2010. Laughter and smiling: Notes on co-occurrences. Journal of Pragmatics 42, 6 (2010), 1499–1512.

[16]

Marcel Heerink, Ben Kröse, Vanessa Evers, and Bob Wielinga. 2010. Assessing acceptance of assistive social agent technology by older adults: the almere model. International Journal of Social Robotics 2 (2010), 361–375.

[17]

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation 9, 8 (1997), 1735–1780.

Digital Library

[18]

Guy Hoffman and Keinan Vanunu. 2013. Effects of robotic companionship on music enjoyment and agent perception. In 2013 8th ACM/IEEE International Conference on Human-Robot Interaction (HRI). IEEE, 317–324.

[19]

Bahar Irfan, James Kennedy, Séverin Lemaignan, Fotios Papadopoulos, Emmanuel Senft, and Tony Belpaeme. 2018. Social Psychology and Human-Robot Interaction: An Uneasy Marriage. In Companion of the 2018 ACM/IEEE International Conference on Human-Robot Interaction (Chicago, IL, USA). ACM, 13–20. https://doi.org/10.1145/3173386.3173389

Digital Library

[20]

Bahar Irfan, Sanna-Mari Kuoppamäki, and Gabriel Skantze. 2023. Between reality and delusion: challenges of applying large language models to companion robots for open-domain dialogues with older adults. (2023). https://doi.org/10.21203/rs.3.rs-2884789/v1

[21]

Bahar Irfan, Jura Miniota, Sofia Thunberg, Erik Lagerstedt, Sanna Kuoppamäki, Gabriel Skantze, and André Pereira. 2024. Human-Robot Interaction Conversational User Enjoyment Scale (HRI CUES). arXiv preprint arXiv:2405.01354 (2024).

[22]

Ziwei Ji, Tiezheng Yu, Yan Xu, Nayeon Lee, Etsuko Ishii, and Pascale Fung. 2023. Towards Mitigating Hallucination in Large Language Models via Self-Reflection. arXiv preprint arXiv:2310.06271 (2023).

[23]

Anoop K., Manjary P. Gangan, Deepak P., and Lajish V. L.2022. Towards an Enhanced Understanding of Bias in Pre-trained Neural Language Models: A Survey with Special Emphasis on Affective Bias. Springer Nature Singapore, 13–45. https://doi.org/10.1007/978-981-19-4453-6_2

[24]

Deborah Kendzierski and Kenneth J DeCarlo. 1991. Physical activity enjoyment scale: Two validation studies.Journal of sport & exercise psychology 13, 1 (1991).

[25]

Weslie Khoo, Long-Jing Hsu, Kyrie Jig Amon, Pranav Vijay Chakilam, Wei-Chu Chen, Zachary Kaufman, Agness Lungu, Hiroki Sato, Erin Seliger, Manasi Swaminathan, Katherine M. Tsui, David J. Crandall, and Selma Sabanović. 2023. Spill the Tea: When Robot Conversation Agents Support Well-Being for Older Adults. In Companion of the 2023 ACM/IEEE International Conference on Human-Robot Interaction (Stockholm, Sweden) (HRI ’23). Association for Computing Machinery, New York, NY, USA, 178–182. https://doi.org/10.1145/3568294.3580067

Digital Library

[26]

Hannah Rose Kirk, Bertie Vidgen, Paul Röttger, and Scott A. Hale. 2024. The benefits, risks and bounds of personalizing the alignment of large language models to individuals. Nature Machine Intelligence 6, 4 (2024), 383–392.

[27]

Michinari Kono and Koichi Araake. 2022. Is it Fun?: Understanding Enjoyment in Non-Game HCI Research. arXiv preprint arXiv:2209.02308 (2022).

[28]

Songpol Kulviwat, Gordon C Bruner II, Anand Kumar, Suzanne A Nasco, and Terry Clark. 2007. Toward a unified theory of consumer acceptance technology. Psychology & Marketing 24, 12 (2007), 1059–1084.

[29]

Yoon Kyung Lee, Yoonwon Jung, Gyuyi Kang, and Sowon Hahn. 2023. Developing Social Robots with Empathetic Non-Verbal Cues Using Large Language Models. In 2023 32nd IEEE International Conference on Robot & Human Interactive Communication (RO-MAN).

[30]

Florian Lingenfelser, Johannes Wagner, Elisabeth André, Gary McKeown, and Will Curran. 2014. An event driven fusion approach for enjoyment recognition in real-time. In Proceedings of the 22nd ACM international conference on Multimedia. 377–386.

Digital Library

[31]

Zhiwei Liu, Kailai Yang, Tianlin Zhang, Qianqian Xie, Zeping Yu, and Sophia Ananiadou. 2024. EmoLLMs: A Series of Emotional Large Language Models and Annotation Tools for Comprehensive Affective Analysis. arxiv:2401.08508 [cs.CL]

[32]

Benjamin Ma, Timothy Greer, Matthew Sachs, Assal Habibi, Jonas Kaplan, and Shrikanth Narayanan. 2019. Predicting human-reported enjoyment responses in happy and sad music. In 2019 8th International Conference on Affective Computing and Intelligent Interaction (ACII). IEEE, 607–613.

[33]

Kaixin Ma, Xinyu Wang, Xinru Yang, Mingtong Zhang, Jeffrey M Girard, and Louis-Philippe Morency. 2019. ElderReact: A Multimodal Dataset for Recognizing Emotional Response in Aging Adults. In 2019 International Conference on Multimodal Interaction (Suzhou, China) (ICMI ’19). Association for Computing Machinery, New York, NY, USA, 349–357. https://doi.org/10.1145/3340555.3353747

Digital Library

[34]

Elisa D Mekler, Julia Ayumi Bopp, Alexandre N Tuch, and Klaus Opwis. 2014. A systematic review of quantitative studies on the enjoyment of digital entertainment games. In Proceedings of the SIGCHI conference on human factors in computing systems. 927–936.

Digital Library

[35]

Jordan Miller and Troy McDaniel. 2022. I enjoyed the chance to meet you and I will always remember you: Healthy Older Adults’ Conversations with Misty the Robot. In 2022 17th ACM/IEEE International Conference on Human-Robot Interaction (HRI). IEEE, 914–918.

[36]

Spencer Ng, Ting-Han Lin, You Li, and Sarah Sebo. 2024. Role-Playing with Robot Characters: Increasing User Engagement through Narrative and Gameplay Agency. In Proceedings of the 2024 ACM/IEEE International Conference on Human-Robot Interaction. 522–532.

Digital Library

[37]

Shogo Nishimura, Takuya Nakamura, Wataru Sato, Masayuki Kanbara, Yuichiro Fujimoto, Hirokazu Kato, and Norihiro Hagita. 2021. Vocal Synchrony of Robots Boosts Positive Affective Empathy. Applied Sciences 11, 6 (Mar 2021), 2502. https://doi.org/10.3390/app11062502

[38]

Joanna Piasek and Katarzyna Wieczorowska-Tobis. 2018. Acceptance and Long-Term Use of a Social Robot by Elderly Users in a Domestic Environment. In 2018 11th International Conference on Human System Interaction (HSI). 478–482. https://doi.org/10.1109/HSI.2018.8431348

[39]

Jonathan Posner, James A Russell, and Bradley S Peterson. 2005. The circumplex model of affect: an integrative approach to affective neuroscience, cognitive development, and psychopathology.Dev Psychopathol 17, 3 (Summer 2005), 715–734.

[40]

C. Price. 2021. The Power of Fun: How to Feel Alive Again. Random House Publishing Group. https://books.google.se/books?id=ZgclEAAAQBAJ

[41]

Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. 2022. Robust Speech Recognition via Large-Scale Weak Supervision. arxiv:2212.04356 [eess.AS]

[42]

Navin Raj Prabhu, Chirag Raman, and Hayley Hung. 2021. Defining and Quantifying Conversation Quality in Spontaneous Interactions. In Companion Publication of the 2020 International Conference on Multimodal Interaction (Virtual Event, Netherlands) (ICMI ’20 Companion). Association for Computing Machinery, New York, NY, USA, 196–205. https://doi.org/10.1145/3395035.3425966

Digital Library

[43]

James A Russell. 1980. A circumplex model of affect.Journal of personality and social psychology 39, 6 (1980), 1161.

[44]

Hesam Sagha, Jun Deng, and Björn Schuller. 2017. The effect of personality trait, age, and gender on the performance of automatic speech valence recognition. In 2017 seventh international conference on affective computing and intelligent interaction (ACII). IEEE, 86–91.

[45]

Mostafa Al Masum Shaikh, Helmut Prendinger, and Ishizuka Mitsuru. 2007. Assessing sentiment of text by semantic dependency and contextual valence analysis. In Affective Computing and Intelligent Interaction: Second International Conference, ACII 2007 Lisbon, Portugal, September 12-14, 2007 Proceedings 2. Springer, 191–202.

Digital Library

[46]

Antti Suni, Juraj Šimko, Daniel Aalto, and Martti Vainio. 2017. Hierarchical representation and estimation of prosody using continuous wavelet transform. Computer Speech & Language 45 (2017), 123–136.

Digital Library

[47]

Viswanath Venkatesh, James YL Thong, and Xin Xu. 2012. Consumer acceptance and use of information technology: extending the unified theory of acceptance and use of technology. MIS quarterly (2012), 157–178.

[48]

MA Walker, DJ Litman, CA Kamm, and A Abella. 1998. Evaluating spoken dialogue agents with PARADISE: Two case studies. Computer Speech & Language 12, 4 (1998), 317–347. https://doi.org/10.1006/csla.1998.0110

[49]

Jin Wang, Liang-Chih Yu, K Robert Lai, and Xue-jie Zhang. 2015. A locally weighted method to improve linear regression for lexical-based valence-arousal prediction. In 2015 International Conference on Affective Computing and Intelligent Interaction (ACII). IEEE, 415–420.

Digital Library

[50]

Wenqing Wei, Sixia Li, Shogo Okada, and Kazunori Komatani. 2021. Multimodal User Satisfaction Recognition for Non-task Oriented Dialogue Systems. In Proceedings of the 2021 International Conference on Multimodal Interaction (Montréal, QC, Canada) (ICMI ’21). Association for Computing Machinery, New York, NY, USA, 586–594. https://doi.org/10.1145/3462244.3479928

Digital Library

[51]

Laura Weidinger, Jonathan Uesato, Maribeth Rauh, Conor Griffin, Po-Sen Huang, John Mellor, Amelia Glaese, Myra Cheng, Borja Balle, Atoosa Kasirzadeh, Courtney Biles, Sasha Brown, Zac Kenton, Will Hawkins, Tom Stepleton, Abeba Birhane, Lisa Anne Hendricks, Laura Rimell, William Isaac, Julia Haas, Sean Legassick, Geoffrey Irving, and Iason Gabriel. 2022. Taxonomy of Risks posed by Language Models. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency (, Seoul, Republic of Korea,) (FAccT ’22). Association for Computing Machinery, New York, NY, USA, 214–229. https://doi.org/10.1145/3531146.3533088

Digital Library

[52]

Jiaxin Xu, Chao Zhang, Raymond H Cuijpers, and Wijnand A IJsselsteijn. 2024. Affective and Cognitive Reactions to Robot-Initiated Social Control of Health Behaviors. In Proceedings of the 2024 ACM/IEEE International Conference on Human-Robot Interaction. 810–819.

Digital Library

[53]

Caleb Ziems, William Held, Omar Shaikh, Jiaao Chen, Zhehao Zhang, and Diyi Yang. 2024. Can Large Language Models Transform Computational Social Science?Computational Linguistics 50, 1 (03 2024), 237–291. https://doi.org/10.1162/coli_a_00502 arXiv:https://direct.mit.edu/coli/article-pdf/50/1/237/2367175/coli_a_00502.pdf

Index Terms

Multimodal User Enjoyment Detection in Human-Robot Conversation: The Power of Large Language Models
1. Computing methodologies
2. Human-centered computing
  1. Human computer interaction (HCI)
    1. Interaction paradigms
      1. Natural language interfaces

Recommendations

Human-robot collaborative tutoring using multiparty multimodal spoken dialogue
HRI '14: Proceedings of the 2014 ACM/IEEE international conference on Human-robot interaction

In this paper, we describe a project that explores a novel experimental setup towards building a spoken, multi-modally rich, and human-like multiparty tutoring robot. A human-robot interaction setup is designed, and a human-human dialogue corpus is ...
Leveraging Large Language Models to Power Chatbots for Collecting User Self-Reported Data
CSCW

Large language models (LLMs) provide a new way to build chatbots by accepting natural language prompts. Yet, it is unclear how to design prompts to power chatbots to carry on naturalistic conversations while pursuing a given goal such as collecting self-...
The effect of head-nod recognition in human-robot conversation
HRI '06: Proceedings of the 1st ACM SIGCHI/SIGART conference on Human-robot interaction

This paper reports on a study of human participants with a robot designed to participate in a collaborative conversation with a human. The purpose of the study was to investigate a particular kind of gestural feedback from human to the robot in these ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

ICMI '24: Proceedings of the 26th International Conference on Multimodal Interaction

November 2024

725 pages

ISBN:9798400704628

DOI:10.1145/3678957

Copyright © 2024 Owner/Author.

This work is licensed under a Creative Commons Attribution International 4.0 License.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 November 2024

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

Swedish Research Council
Digital Futures

Conference

ICMI '24

ICMI '24: INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION

November 4 - 8, 2024

San Jose, Costa Rica

Acceptance Rates

Overall Acceptance Rate 453 of 1,080 submissions, 42%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
227
Total Downloads

Downloads (Last 12 months)227
Downloads (Last 6 weeks)102

Reflects downloads up to 12 Jan 2025

Other Metrics

View Author Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents