research-article

Detecting when Users Disagree with Generated Captions

Authors:

Omair Shahzad Bhatti,

Harshinee Sriram,

Abdulrahman Mohamed Selim,

Cristina Conati,

Daniel SonntagAuthors Info & Claims

ICMI Companion '24: Companion Proceedings of the 26th International Conference on Multimodal Interaction

Pages 195 - 203

https://doi.org/10.1145/3686215.3688382

Published: 04 November 2024 Publication History

Abstract

The pervasive integration of artificial intelligence (AI) into daily life has led to a growing interest in AI agents that can learn continuously. Interactive Machine Learning (IML) has emerged as a promising approach to meet this need, essentially involving human experts in the model training process, often through iterative user feedback. However, repeated feedback requests can lead to frustration and reduced trust in the system. Hence, there is increasing interest in refining how these systems interact with users to ensure efficiency without compromising user experience. Our research investigates the potential of eye tracking data as an implicit feedback mechanism to detect user disagreement with AI-generated captions in image captioning systems. We conducted a study with 30 participants using a simulated captioning interface and gathered their eye movement data as they assessed caption accuracy. The goal of the study was to determine whether eye tracking data can predict user agreement or disagreement effectively, thereby strengthening IML frameworks. Our findings reveal that, while eye tracking shows promise as a valuable feedback source, ensuring consistent and reliable model performance across diverse users remains a challenge.

References

[1]

Saleema Amershi, Maya Cakmak, William Bradley Knox, and Todd Kulesza. 2014. Power to the People: The Role of Humans in Interactive Machine Learning. AI Magazine 35, 4 (Dec. 2014), 105–120. https://doi.org/10.1609/aimag.v35i4.2513

Digital Library

[2]

Michael Barz, Omair Shahzad Bhatti, and Daniel Sonntag. 2021. Implicit Estimation of Paragraph Relevance From Eye Movements. Frontiers Comput. Sci. 3 (2021), 808507. https://doi.org/10.3389/fcomp.2021.808507

[3]

Michael Barz, Sven Stauden, and Daniel Sonntag. 2020. Visual Search Target Inference in Natural Interaction Settings with Machine Learning. In ETRA ’20: 2020 Symposium on Eye Tracking Research and Applications, Stuttgart, Germany, June 2-5, 2020, Andreas Bulling, Anke Huckauf, Eakta Jain, Ralph Radach, and Daniel Weiskopf (Eds.). ACM, 1:1–1:8. https://doi.org/10.1145/3379155.3391314

Digital Library

[4]

Nilavra Bhattacharya, Somnath Rakshit, and Jacek Gwizdka. 2020. Towards Real-time Webpage Relevance Prediction UsingConvex Hull Based Eye-tracking Features. In ACM Symposium on Eye Tracking Research and Applications (Stuttgart, Germany) (ETRA ’20 Adjunct). Association for Computing Machinery, New York, NY, USA, Article 28, 10 pages. https://doi.org/10.1145/3379157.3391302

Digital Library

[5]

Omair Bhatti, Michael Barz, and Daniel Sonntag. 2022. Leveraging Implicit Gaze-Based User Feedback for Interactive Machine Learning. In KI 2022: Advances in Artificial Intelligence, Ralph Bergmann, Lukas Malburg, Stephanie C. Rodermund, and Ingo J. Timm (Eds.). Springer International Publishing, Cham, 9–16.

[6]

Nigel Bosch, Yuxuan Chen, and Sidney D’Mello. 2014. It’s Written on Your Face: Detecting Affective States from Facial Expressions while Learning Computer Programming. In Intelligent Tutoring Systems, Stefan Trausan-Matu, Kristy Elizabeth Boyer, Martha Crosby, and Kitty Panourgia (Eds.). Springer International Publishing, Cham, 39–44.

[7]

Maya Cakmak, Crystal Chao, and Andrea L. Thomaz. 2010. Designing Interactions for Robot Active Learners. IEEE Transactions on Autonomous Mental Development 2, 2 (2010), 108–118. https://doi.org/10.1109/TAMD.2010.2051030

Digital Library

[8]

Sidney K. D’Mello, Scotty D. Craig, and Art C. Graesser. 2009. Multimethod Assessment of Affective Experience and Expression during Deep Learning. Int. J. Learn. Technol. 4, 3/4 (oct 2009), 165–187. https://doi.org/10.1504/IJLT.2009.028805

Digital Library

[9]

John J. Dudley and Per Ola Kristensson. 2018. A Review of User Interface Design for Interactive Machine Learning. ACM Trans. Interact. Intell. Syst. 8, 2, Article 8 (jun 2018), 37 pages. https://doi.org/10.1145/3185517

Digital Library

[10]

SIDNEY K D’Mello and Arthur C Graesser. 2014. Confusion. In International handbook of emotions in education. Routledge, 299–320.

[11]

Paul Ekman, Wallace V Friesen, Maureen O’sullivan, Anthony Chan, Irene Diacoyanni-Tarlatzis, Karl Heider, Rainer Krause, William Ayhan LeCompte, Tom Pitcairn, Pio E Ricci-Bitti, 1987. Universals and cultural differences in the judgments of facial expressions of emotion.Journal of personality and social psychology 53, 4 (1987), 712.

[12]

Maliheh Ghajargar, Jan Persson, Jeffrey Bardzell, Lars Holmberg, and Agnes Tegen. 2020. The UX of Interactive Machine Learning. Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/3419249.3421236

Digital Library

[13]

Donald Honeycutt, Mahsan Nourani, and Eric Ragan. 2020. Soliciting Human-in-the-Loop User Feedback for Interactive Machine Learning Reduces User Trust and Impressions of Model Accuracy. Proceedings of the AAAI Conference on Human Computation and Crowdsourcing 8, 1 (Oct. 2020), 63–72. https://ojs.aaai.org/index.php/HCOMP/article/view/7464

[14]

Lea Krause and Piek Vossen. 2020. When to explain: Identifying explanation triggers in human-agent interaction. In 2nd Workshop on Interactive Natural Language Technology for Explainable Artificial Intelligence. 55–60.

[15]

Sébastien Lallé, Cristina Conati, and Giuseppe Carenini. 2016. Predicting Confusion in Information Visualization from Eye Tracking and Interaction Data. In Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence (New York, New York, USA) (IJCAI’16). AAAI Press, 2529–2535.

Digital Library

[16]

Jia Zheng Lim, James Mountstephens, and Jason Teo. 2020. Emotion Recognition Using Eye-Tracking: Taxonomy, Review and Current Challenges. Sensors 20, 8 (2020). https://doi.org/10.3390/s20082384

[17]

luxonis. 2020. OAK-D: Stereo camera with Edge AI. https://luxonis.com/ Stereo Camera with Edge AI capabilities from Luxonis and OpenCV.

[18]

Abdulrahman Mohamed Selim, Michael Barz, Omair Shahzad Bhatti, Hasan Md Tusfiqur Alam, and Daniel Sonntag. 2024. A review of machine learning in scanpath analysis for passive gaze-based interaction. Frontiers in Artificial Intelligence 7 (2024). https://doi.org/10.3389/frai.2024.1391745

[19]

Sucheta Nadkarni and Reetika Gupta. 2007. A Task-Based Model of Perceived Website Complexity. MIS Quarterly 31, 3 (2007), 501–524. http://www.jstor.org/stable/25148805

Digital Library

[20]

Anneli Olsen. 2012. The Tobii IVT Fixation Filter Algorithm description. https://api.semanticscholar.org/CorpusID:52834703

[21]

Mariya Pachman, Amaël Arguel, Lori Lockyer, Gregor Kennedy, and Jason Lodge. 2016. Eye tracking and early detection of confusion in digital learning environments: Proof of concept. Australasian Journal of Educational Technology 32, 6 (Dec. 2016). https://doi.org/10.14742/ajet.3060

[22]

Manuela Pollak, Andrea Salfinger, and Karin Anna Hummel. 2022. Teaching Drones on the Fly: Can Emotional Feedback Serve as Learning Signal for Training Artificial Agents?arXiv preprint arXiv:2202.09634 (2022).

[23]

Joni Salminen, Bernard J. Jansen, Jisun An, Soon-Gyo Jung, Lene Nielsen, and Haewoon Kwak. 2018. Fixation and Confusion: Investigating Eye-Tracking Participants’ Exposure to Information in Personas. In Proceedings of the 2018 Conference on Human Information Interaction I&’ Retrieval (New Brunswick, NJ, USA) (CHIIR ’18). Association for Computing Machinery, New York, NY, USA, 110–119. https://doi.org/10.1145/3176349.3176391

Digital Library

[24]

Joni Salminen, Mridul Nagpal, Haewoon Kwak, Jisun An, Soon-gyo Jung, and Bernard J. Jansen. 2019. Confusion Prediction from Eye-Tracking Data: Experiments with Machine Learning. In Proceedings of the 9th International Conference on Information Systems and Technologies (Cairo, Egypt) (icist 2019). Association for Computing Machinery, New York, NY, USA, Article 5, 9 pages. https://doi.org/10.1145/3361570.3361577

Digital Library

[25]

Dario D Salvucci and Joseph H Goldberg. 2000. Identifying fixations and saccades in eye-tracking protocols. In Proceedings of the 2000 symposium on Eye tracking research & applications. 71–78.

Digital Library

[26]

Abraham. Savitzky and M. J. E. Golay. 1964. Smoothing and Differentiation of Data by Simplified Least Squares Procedures.Analytical Chemistry 36, 8 (1964), 1627–1639. https://doi.org/10.1021/ac60214a047

[27]

Abdulrahman Mohamed Selim, Omair Shahzad Bhatti, Michael Barz, and Daniel Sonntag. 2024. Perceived Text Relevance Estimation Using Scanpaths and GNNs. In Proceedings of the INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION (ICMI ’24) (San Jose, Costa Rica) (ICMI ’24). Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/3678957.3685736

Digital Library

[28]

Ravi Shekhar, Sandro Pezzelle, Yauhen Klimovich, Aurelie Herbelot, Moin Nabi, Enver Sangineto, and Raffaella Bernardi. 2017. "FOIL it! Find One mismatch between Image and Language caption". In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL) (Volume 1: Long Papers). 255–265.

[29]

Shane D Sims and Cristina Conati. 2020. A neural architecture for detecting user confusion in eye-tracking data. In Proceedings of the 2020 international conference on multimodal interaction (Virtual Event, Netherlands) (ICMI ’20). Association for Computing Machinery, New York, NY, USA, 15–23. https://doi.org/10.1145/3382507.3418828

Digital Library

[30]

Harshinee Sriram, Cristina Conati, and Thalia Field. 2023. Classification of Alzheimer’s Disease with Deep Learning on Eye-tracking Data. In Proceedings of the 25th International Conference on Multimodal Interaction. 104–113.

Digital Library

[31]

Benjamin Voloh, Marcus Watson, Seth Konig, and Thilo Womelsdorf. 2020. MAD saccade: statistically robust saccade threshold estimation via the median absolute deviation. Journal of Eye Movement Research 12 (05 2020). https://doi.org/10.16910/jemr.12.8.3

[32]

Jan Zacharias, Michael Barz, and Daniel Sonntag. 2018. A Survey on Deep Learning Toolkits and Libraries for Intelligent User Interfaces. arxiv:1803.04818 [cs.HC]

[33]

Zhihong Zeng, Maja Pantic, Glenn I. Roisman, and Thomas S. Huang. 2009. A Survey of Affect Recognition Methods: Audio, Visual, and Spontaneous Expressions. IEEE Transactions on Pattern Analysis and Machine Intelligence 31, 1 (2009), 39–58. https://doi.org/10.1109/TPAMI.2008.52

Digital Library

Index Terms

Detecting when Users Disagree with Generated Captions
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Supervised learning
        Supervised learning by classification
2. Human-centered computing
  1. Human computer interaction (HCI)
    1. HCI design and evaluation methods
      1. User models
      2. User studies

Recommendations

Interactive Fixation-to-AOI Mapping for Mobile Eye Tracking Data based on Few-Shot Image Classification
IUI '23 Companion: Companion Proceedings of the 28th International Conference on Intelligent User Interfaces

Mobile eye tracking is an important tool in psychology and human-centred interaction design for understanding how people process visual scenes and user interfaces. However, analysing recordings from mobile eye trackers, which typically include an ...
Leveraging Implicit Gaze-Based User Feedback for Interactive Machine Learning
KI 2022: Advances in Artificial Intelligence
Abstract
Interactive Machine Learning (IML) systems incorporate humans into the learning process to enable iterative and continuous model improvements. The interactive process can be designed to leverage the expertise of domain experts with no background ...
Pistol: Pupil Invisible Supportive Tool in the Wild
Abstract
This paper is an in the wild evaluation of the eye tracking tool Pistol. Pistol supports Pupil Invisible projects and other eye trackers (Dikablis, Emke GmbH, Look, Pupil, and many more) in offline mode. For all eye tracking recordings, Pistol is ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ICMI Companion '24: Companion Proceedings of the 26th International Conference on Multimodal Interaction

November 2024

252 pages

ISBN:9798400704635

DOI:10.1145/3686215

Editors:
Hayley Hung
Delft University of Technology
,
Catharine Oertel
Delft University of Technology
,
Mohammad Soleymani
University of Southern California
,
Theodora Chaspari
University of Boulder, Colorado
,
Hamdi Dibeklioglu
Bilkent University
,
Jainendra Shukla
IIIT Delhi
,
Khiet Truong
University of Twente

Copyright © 2024 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGCHI: ACM Special Interest Group on Computer-Human Interaction

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 November 2024

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

Conference

ICMI '24

Sponsor:

SIGCHI

ICMI '24: INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION

November 4 - 8, 2024

San Jose, Costa Rica

Acceptance Rates

Overall Acceptance Rate 453 of 1,080 submissions, 42%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
59
Total Downloads

Downloads (Last 12 months)59
Downloads (Last 6 weeks)8

Reflects downloads up to 08 Feb 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View Table of Conten