abstract

Public Access

Fusing Dialogue and Gaze From Discussions of 2D and 3D Scenes

Authors:

preethi vaidyanathan,

Reynold Bailey,

Cecilia Ovesdotter AlmAuthors Info & Claims

ICMI '19: Adjunct of the 2019 International Conference on Multimodal Interaction

Article No.: 1, Pages 1 - 6

https://doi.org/10.1145/3351529.3360661

Published: 14 October 2019 Publication History

All formats PDF

Abstract

Conversation partners rely on inference using each other’s gaze and utterances to negotiate shared meaning. In contrast, dialogue systems still operate mostly with unimodal question or command and response interactions. To realize systems that can intuitively discuss and collaborate with humans, we should consider other sensory information. We begin to address this limitation with an innovative study that acquires, analyzes, and fuses interlocutors’ discussion and gaze. Introducing a discussion-based elicitation task, we collect gaze with remote and wearable eye trackers alongside dialogue as interlocutors come to consensus on questions about an on-screen 2D image and a real-world 3D scene. We analyze the visual-linguistic patterns, and also map the modalities onto the visual environment by extending a multimodal image region annotation framework using statistical machine translation for multimodal fusion, applying three ways of fusing speakers’ gaze and discussion.

References

[1]

Nicola C. Anderson, Walter F. Bischof, Kaitlin E.W. Laidlaw, Evan F. Risko, and Alan Kingstone. 2013. Recurrence quantification analysis of eye movements. Behavior Research Methods 45 (2013), 842–856.

[2]

Dan Bohus and Eric Horvitz. 2010. Facilitating multiparty dialog with gaze, gesture, and speech. In International Conference on Multimodal Interfaces and the Workshop on Machine Learning for Multimodal Interaction.

Digital Library

[3]

Aliya Gangji, Trevor Walden, Preethi Vaidyanathan, Emily Prud’hommeaux, Reynold Bailey, and Cecilia O Alm. 2017. Using co-captured face, gaze, and verbal reactions to images of varying emotional content for analysis and semantic alignment. In Proceedings of the AAAI Workshop on Human-Aware Artificial Intelligence.

[4]

Zenzi M. Griffin. 2004. Why look? Reasons for eye movements related to language production. In The Interface of Language, Vision, and Action: Eye Movements and the Visual World. Psychology Press, 213–248.

[5]

Zenzi M. Griffin and Kathryn Bock. 2000. What the eyes say about speaking. Psychological Science 11, 4 (2000), 274–279.

[6]

Nikita Haduong, David Nester, Preethi Vaidyanathan, Emily Prud’hommeaux, Reynold Bailey, and Cecilia Alm. 2018. Multimodal Alignment for Affective Content. In Proceedings of the AAAI Workshop on Affective Content Analysis.

[7]

Jana Holsanova. 2008. Discourse, Vision, and Cognition. John Benjamins Publishing Company.

[8]

IBM. 2019. Watson Text to Speech. https://www.ibm.com/watson/services/text-to-speech/

[9]

[9] iMotions.2019. https://imotions.com/

[10]

[10] SensoMotoric Instruments.2019. https://www.smivision.com/eye-tracking/products/software-for-eye-tracking/

[11]

Kristiina Jokinen, Kazuaki Harada, Masafumi Nishida, and Seiichi Yamamoto. 2010. Turn-alignment using eye-gaze and speech in conversational interaction. In Eleventh Annual Conference of the International Speech Communication Association.

[12]

Dimosthenis Kontogiorgos, Vanya Avramova, Simon Alexandersson, Patrik Jonell, Catharine Oertel, Jonas Beskow, Gabriel Skantze, and Joakim Gustafsson. 2018. A multimodal corpus for mutual gaze and joint attention in multiparty situated interaction. In Language Resources and Evaluation Conference.

[13]

[13] Pupil Labs.2019. https://pupil-labs.com/

[14]

Percy Liang, Ben Taskar, and Dan Klein. 2006. Alignment by agreement. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 104–111.

Digital Library

[15]

Yuki Matsuda, Dmitrii Fedotov, Yuta Takahashi, Yutaka Arakawa, Keiichi Yasumoto, and Wolfgang Minker. 2018. Estimating User Satisfaction Impact in Cities using Physical Reaction Sensing and Multimodal Dialogue System. In International Workshop on Spoken Dialogue Systems Technology.

[16]

Shaolin Qu and Joyce Y. Chai. 2008. Incorporating temporal and semantic information with eye gaze for automatic word acquisition in multimodal conversational systems. In Proceedings of the ACL Conference on Empirical Methods in Natural Language Processing. 244–253.

Digital Library

[17]

Daniel C. Richardson and Rick Dale. 2005. Looking to understand: The coupling between speakers’ and listeners’ eye movements and its relationship to discourse comprehension. Cognitive Science 29, 6 (2005), 1045–1060.

[18]

Daniel C. Richardson, Rick Dale, and Natasha Z. Kirkham. 2007. The art of conversation is coordination. Psychological Science 18, 5 (2007), 407–413.

[19]

Anthony Santella and Doug DeCarlo. 2004. Robust clustering of eye movement recordings for quantification of visual interest. In Proceedings of the ACM symposium on Eye tracking research & Applications. 27–34.

Digital Library

[20]

TJ Tsai, Andreas Stolcke, and Malcolm Slaney. 2015. Multimodal addressee detection in multiparty dialogue systems. In Proceedings of the IEEE Conference on Acoustics, Speech and Signal Processing. 2314–2318.

[21]

Preethi Vaidyanathan. 2017. Visual-Linguistic Semantic Alignment: Fusing Human Gaze and Spoken Narratives for Image Region Annotation. Ph.D. Dissertation. Rochester Institute of Technology.

[22]

Preethi Vaidyanathan, Emily Prud’hommeaux, Jeff B. Pelz, Cecilia Ovesdotter Alm, and Anne R. Haake. 2016. Fusing eye movements and observer narratives for expert-driven image-region annotations. In Proceedings of the Ninth Biennial ACM Symposium on Eye Tracking Research & Applications. 27–34.

Digital Library

[23]

Preethi Vaidyanathan, Emily T. Prud’hommeaux, Jeff B. Pelz, and Cecilia O. Alm. 2018. SNAG: Spoken Narratives and Gaze Dataset. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Vol. 2. 132–137.

[24]

Alfred Yarbus. 1965. Role of eye movements in the visual process.Nauka Press, Moscow.

[25]

Kiwon Yun, Yifan Peng, Dimitris Samaras, Gregory J. Zelinsky, and Tamara L. Berg. 2013. Studying Relationships between Human Gaze, Description, and Computer Vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 739–746.

Digital Library

Cited By

Vaidyanathan PPrud’hommeaux EAlm CPelz J(2020)Computational framework for fusing eye movements and spoken narratives for image annotationJournal of Vision10.1167/jov.20.7.1320:7(13)Online publication date: 17-Jul-2020
https://doi.org/10.1167/jov.20.7.13

Recommendations

Facilitating multiparty dialog with gaze, gesture, and speech
ICMI-MLMI '10: International Conference on Multimodal Interfaces and the Workshop on Machine Learning for Multimodal Interaction

We study how synchronized gaze, gesture and speech rendered by an embodied conversational agent can influence the flow of conversations in multiparty settings. We begin by reviewing a computational framework for turn-taking that provides the foundation ...
Looking for Laughs: Gaze Interaction with Laughter Pragmatics and Coordination
ICMI '21: Proceedings of the 2021 International Conference on Multimodal Interaction

Laughter and gaze have an important role in managing and coordi-nating social interactions. In the current work, using a multimodal corpus of dyadic taste-testing interactions, we explore whether laughs performing different pragmatic functions are ...
Modeling gaze behavior for a 3D ECA in a dialogue situation
IUI '06: Proceedings of the 11th international conference on Intelligent user interfaces

This paper presents an approach to model the gaze behavior of an Embodied Conversational Agent in a real time multimodal dialogue interaction with users. The ECA's gaze control results from the fusion of a rational dialogue engine based on natural ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ICMI '19: Adjunct of the 2019 International Conference on Multimodal Interaction

October 2019

86 pages

ISBN:9781450369374

DOI:10.1145/3351529

Copyright © 2019 Owner/Author.

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 October 2019

Check for updates

Author Tags

Qualifiers

Abstract
Research
Refereed limited

Funding Sources

National Science Foundation

Conference

ICMI '19

ICMI '19: INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION

October 14 - 18, 2019

Suzhou, China

Acceptance Rates

Overall Acceptance Rate 453 of 1,080 submissions, 42%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
502
Total Downloads

Downloads (Last 12 months)65
Downloads (Last 6 weeks)20

Reflects downloads up to 30 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Vaidyanathan PPrud’hommeaux EAlm CPelz J(2020)Computational framework for fusing eye movements and spoken narratives for image annotationJournal of Vision10.1167/jov.20.7.1320:7(13)Online publication date: 17-Jul-2020
https://doi.org/10.1167/jov.20.7.13

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents