research-article

Multimodal conversational interaction with robots

Authors:

The Handbook of Multimodal-Multisensor Interfaces: Language Processing, Software, Commercialization, and Emerging Directions - Volume 3

July 2019

Pages 77 - 104

https://doi.org/10.1145/3233795.3233799

Published: 01 July 2019 Publication History

Get Access

References

[1]

S. Al Moubayed, J. Beskow, and B. Granström. 2010. Auditory-visual prominence: From intelligibility to behavior. Journal on Multimodal User Interfaces, 3(4): 299--311. . 82

Crossref

Google Scholar

[2]

S. Al Moubayed, J. Edlund, and J. Beskow. 2012. Taming Mona Lisa: Communicating gaze faithfully in 2D and 3D facial projections. ACM Transactions on Interactive Intelligent Systems, 1(2): 25. 82, 83

Digital Library

Google Scholar

[3]

S. Al Moubayed, G. Skantze, and J. Beskow. 2013. The Furhat Back-Projected Humanoid Head---Lip reading, gaze and multiparty interaction. International Journal of Humanoid Robotics, 10(1). 83

Crossref

Google Scholar

[4]

P. D. Allopenna, J. S. Magnuson, and M. K. Tanenhaus. 1998. Tracking the time course of spoken word recognition using eye movements: Evidence for continuous mapping models. Journal of Memory and Language, 38(4): 419--439. 95

Crossref

Google Scholar

[5]

J. Allwood, J. Nivre, and E. Ahlsen. 1992. On the semantics and pragmatics of linguistic feedback. Journal of Semantics, 9(1): 1--26. 93

Crossref

Google Scholar

[6]

I. Almajai and B. Milner. 2008. Using audio-visual features for robust voice activity detection in clean and noisy speech. In Proceedings of the 16th European Signal Processing Conference, pp. 1--5. 86

Google Scholar

[7]

M. Argyle and M. Cook. 1976. Gaze and mutual gaze. 81

Google Scholar

[8]

S. Baron-Cohen. (1995). The eye direction detector (EDD) and the shared attention mechanism (SAM): Two cases for evolutionary psychology. In C. Moore, and P. J. Dunham, editors, Joint Attention: Its Origins and Role in Development (pp. 41--60). Hillsdale, NJ: Erlbaum. 95

Google Scholar

[9]

J. Beskow, B. Granström, and D. House. 2006. Visual correlates to prominence in several expressive modes. In Proceedings of Interspeech 2006, pp. 1272--1275. Pittsburg, PA. 95

Google Scholar

[10]

S. Bock, P. Dicke, and P. Thier. 2008. How precise is gaze following in humans? Vision Research, 48(7): 946--957. 81

Crossref

Google Scholar

[11]

D. Bohus, and E. Horvitz. 2009. Learning to Predict Engagement with a Spoken Dialog System in Open-World Settings. In Proceedings of SIGdial. London, UK. 92

Digital Library

Google Scholar

[12]

D. Bohus and E. Horvitz. 2010. Facilitating multiparty dialog with gaze, gesture, and speech. In Proceedings of ICMI. Beijing, China. 92

Digital Library

Google Scholar

[13]

D. Bohus and E. Horvitz. 2014. Managing Human-Robot Engagement with Forecasts and . . . um . . . Hesitations. In Proceedings of the 16th International Conference on Multimodal Interaction pp. 2--9. 92

Digital Library

Google Scholar

[14]

J. D. Boucher, U. Pattacini, A. Lelong, G. Bailly, F. Elisei, S. Fagel, P. F. Dominey, and J. Ventre-Dominey. 2012. I reach faster when I see you look: Gaze effects in human-human and human-robot face-to-face cooperation. Frontiers in Neurorobotics, 6. 96

Crossref

Google Scholar

[15]

R. Braybrook. 2004. Three "d" missions---dull, dirty and dangerous. Armada International, 28(1): 10--12. 77

Google Scholar

[16]

C. Breazeal. 2003. Toward sociable robots. Robotics and Autonomous Systems, 42(3): 167--175. 82

Crossref

Google Scholar

[17]

V. Bruce. 1996. The role of the face in communication: Implications for videophone design. Interacting with Computers, 8(2): 166--176. 80

Digital Library

Google Scholar

[18]

J. Cassell, J. Sullivan, S. Prevost, and E. F. Churchill 2000. Embodied Conversational Agents. Boston, MA: MIT Press. 82

Google Scholar

[19]

H. H. Clark and M. A. Krych. 2004. Speaking while monitoring addressees for understanding. Journal of Memory and Language, 50, 62--81. 95

Google Scholar

[20]

H. H. Clark and C. R. Marshall. 1981. Definite reference and mutual knowledge. In A. K. Joshi, B. L. Webber, and I. A. Sag, editors, Elements of Discourse Understanding (pp. 10--63). Cambridge, UK: Cambridge University Press. 95

Google Scholar

[21]

H. H. Clark. 1996. Using Language. Cambridge, UK: Cambridge University Press. 79, 92, 93, 774

Google Scholar

[22]

H. H. Clark. 2005. Coordinating with each other in a material world. Discourse Studies, 7(4--5): 507--525. 97

Google Scholar

[23]

F. Delaunay, J. De Greeff, and T. Belpaeme. 2009. Towards retro-projected robot faces: an alternative to mechatronic and android faces. In RO-MAN 2009-The 18th IEEE International Symposium on Robot and Human Interactive Communication (pp. 306--311). 83

Google Scholar

[24]

S. Duncan. 1972. Some Signals and Rules for Taking Speaking Turns in Conversations. Journal of Personality and Social Psychology, 23(2): 283--292. 91

Google Scholar

[25]

P. Ekman, and W. V. Friesen. 1971. Constants across cultures in the face and emotion. Journal of Personality and Social Psychology, 17, 124--129. 81

Google Scholar

[26]

M. E. Foster, A. Gaschler, M. Giuliani, A. Isard, M. Pateraki, and R. Petrick. 2012. Two people walk into a bar: Dynamic multi-party social interaction with a robot agent. In Proceedings of the 14th ACM International Conference on Multimodal Interaction (pp. 3--10). 83

Digital Library

Google Scholar

[27]

B. Granström, D. House, and M. G. Swerts. (2002). Multimodal feedback cues in human-machine interactions. In B. Bel, and I. Marlien, editors, Proceedings of the Speech Prosody 2002 Conference (pp. 347--350). Aix-en-Provence: Laboratoire Parole et Langage. 94

Google Scholar

[28]

A. Gravano and J. Hirschberg. 2011. Turn-taking cues in task-oriented dialogue. Computer Speech & Language, 25(3): 601--634. 91

Digital Library

Google Scholar

[29]

A. Gravano, S. Benus, H. Chavés, J. Hirschberg, and L. Wilcox. 2007. On the role of context and prosody in the interpretation of 'okey'. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics (pp. 800--807). Prague, Czech Republic. 94

Google Scholar

[30]

D. Harel. 1987. Statecharts: A visual formalism for complex systems. Science of Computer Programming, 8, 231--274. 80

Digital Library

Google Scholar

[31]

J. Hornstein, M. Lopes, J. Santos-Victor, and F. Lacerda. 2006. Sound localization for humanoid robots-building audio-motor maps based on the HRTF. In 2006 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 1170--1176. 85

Google Scholar

[32]

C-M. Huang and A. Thomaz. 2010. Joint attention in human-robot interaction. In Papers from the AAAI Fall Symposium on Dialog with Robots, pp. 32--37. Arlington, VA. 96, 97

Google Scholar

[33]

M. Imai, T. Ono, and H. Ishiguro. 2003. Physical relation and expression: Joint attention for human-robot interaction. IEEE Transaction on Industrial Electronics, 50(4): 636--643. 97

Crossref

Google Scholar

[34]

R. Ishii, K. Otsuka, S. Kumano, and J. Yamato. 2014. Analysis of Respiration for Prediction of "Who will be next speaker and when?" in multi-party meetings. In Proceedings of ICMI, pp. 18--25. New York: ACM. 91

Digital Library

Google Scholar

[35]

Itseez. 2016. Open source computer vision library. https://github.com/itseez/opencv. Accessed September 2016. 88

Google Scholar

[36]

M. Johansson and G. Skantze. 2015. Opportunities and Obligations to Take Turns in Collaborative Multi-Party Human-Robot Interaction. In Proceedings of SIGDIAL. Prague, Czech Republic. 92

Google Scholar

[37]

M. Johansson, G. Skantze, and J. Gustafson. 2013. Head Pose Patterns in Multiparty Human-Robot Team-Building Interactions. In International Conference on Social Robotics-ICSR 2013. Bristol, UK. 97

Digital Library

Google Scholar

[38]

N. Kanwisher, J. McDermott, and M. Chun. 1997. The fusiform face area: a module in human extrastriate cortex specialized for face perception. Journal of Neuroscience, 17(11): 4302--4311. 80

Crossref

Google Scholar

[39]

M. Katzenmaier, R. Stiefelhagen, T. Schultz, I. Rogina, and A. Waibel. 2004. Identifying the Addressee in Human-Human-Robot Interactions based on Head Pose and Speech. In Proceedings of International Conference on Multimodal Interfaces ICMI 2004. State College, PA. 92

Digital Library

Google Scholar

[40]

T. Kawahara, T. Iwatate, and K. Takanashi. 2012. Prediction of Turn-Taking by Combining Prosodic and Eye-Gaze Information in Poster Conversations . . . In Interspeech 2012. 97

Google Scholar

[41]

A. Kendon. 1967. Some functions of gaze direction in social interaction. Acta Psychologica, 26, 22--63. 91

Google Scholar

[42]

H. Koiso, Y. Horiuchi, S. Tutiya, A. Ichikawa, and Y. Den. 1998. An analysis of turn-taking and backchannels based on prosodic and syntactic features in Japanese Map Task dialogs. Language and Speech, 41, 295--321. 91, 92

Crossref

Google Scholar

[43]

T. Kuratate, Y. Matsusaka, B. Pierce, and G. Cheng. 2011. Mask-bot: a life-size robot head using talking head animation for human-robot communication. In Proceedings of the 11th IEEE-RAS International Conference on Humanoid Robots (Humanoids) pp. 99--104. 83

Google Scholar

[44]

C. Lai. 2010. What do you mean, you're uncertain?: The interpretation of cue words and rising intonation in dialogue. In Proceedings of Interspeech. Makuhari, Japan. 94

Google Scholar

[45]

S. Langton, R. Watt, and V. Bruce. 2000. Do the eyes have it? Cues to the direction of social attention. Trends in cognitive sciences, 4(2): 50--59. 81

Google Scholar

[46]

P. Liu, D. Glas, T. Kanda, H. Ishiguro, and N. Hagita. 2014. How to train your robot-teaching service robots to reproduce human social behavior. In Proceedings of Robot and Human Interactive Communication (RO-MAN) pp. 961--968. 84

Google Scholar

[47]

H. McGurk and J. MacDonald. 1976. Hearing lips and seeing voices. Nature, 264 (5588), 746--748. 81

Google Scholar

[48]

M. McTear, Z. Callejas, and D. Griol. 2016. The Conversational Interface. Springer. 78, 89

Digital Library

Google Scholar

[49]

R. Meena, G. Skantze, and J. Gustafson. 2014. Data-driven Models for timing feedback responses in a Map Task dialogue system. Computer Speech and Language, 28(4): 903--922. 92

Crossref

Google Scholar

[50]

A. Moon, D. Troniak, B. Gleeson, M. Pan, M. Zheng, B. Blumer, K. MacLean, and E. Croft. 2014. Meet Me Where I'm Gazing: How Shared Attention Gaze Affects Human-robot Handover Timing. In Proceedings of the 2014 ACM/IEEE International Conference on Human-robot Interaction, pp. 334--341. New York: ACM. 97

Digital Library

Google Scholar

[51]

L. P. Morency, I. de Kok, and J. Gratch. 2008. Predicting listener backchannels: A probabilistic multimodal approach. In Proceedings of IVA, pp. 176--190. Tokyo, Japan. 93

Digital Library

Google Scholar

[52]

M. Mori. 1970. The Uncanny Valley. Energy, 7(4): 33--35. 83

Google Scholar

[53]

B. Mutlu, T. Kanda, J. Forlizzi, J. Hodgins, and H. Ishiguro. 2012. Conversational Gaze Mechanisms for Humanlike Robots. ACM Transactions on Interactive Intelligent Systems, 1(2), 12: 1--12:33. 92

Digital Library

Google Scholar

[54]

D. Nguyen, and J. Canny. 2005. MultiView: spatially faithful group video conferencing. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 799--808. 82

Digital Library

Google Scholar

[55]

D. C. O'Connell, S, Kowal, and E. Kaltenbacher. 1990. Turn-taking: A critical analysis of the research tradition. Journal of Psycholingistic Research, 19(6): 345--373. 92

Crossref

Google Scholar

[56]

Y. Okumura, Y. Kanakogi, T. Kanda, H. Ishiguro, and S. Itakura. 2013. Infants understand the referential nature of human gaze but not robot gaze. Journal of Experimental Child Psychology, 116, 86--95. 96

Crossref

Google Scholar

[57]

G. Potamianos, E. Marcheret, Y. Mroueh, V. Goel, A. Koumbaroulis, A. Vartholomaios, and S. Thermos. 2017. Audio and Visual Modality Combination in Speech Processing Applications. Association for Computing Machinery, pp. 489--543. Morgan & Claypool New York, NY. 88

Digital Library

Google Scholar

[58]

C. Rich, B. Ponsler, A. Holroyd, and C. Sidner. 2010. Recognizing engagement in human-robot interaction. In 2010 5th ACM/IEEE International Conference on Human-Robot Interaction (HRI), pp. 375--382. 97

Digital Library

Google Scholar

[59]

V. Rieser and O. Lemon. 2012. Reinforcement Learning for Adaptive Dialogue Systems. Berlin: Springer-Verlag. 89

Digital Library

Google Scholar

[60]

M. Roddy, G. Shantze, and N. Harte. 2018. Multimodal continuous turn-taking prediction using multiscale RNNs. In Proceedings of the 2018 on International Conference on Multimodal Interaction-ICMI '18, pp. 186--190. New York, New York, USA: ACM Press. 92

Digital Library

Google Scholar

[61]

N. Roy, G. Baltus, D. Fox, F. Gemperle, J. Goetz, T. Hirsch, D. Margaritis, M. Montemerlo, J. Pineau, and J. Schulte. 2000. Towards personal service robots for the elderly. In Workshop on Interactive Robots and Entertainment (WIRE 2000), p. 184. 83

Google Scholar

[62]

H. Sacks, E. Schegloff, and G. Jefferson. 1974. A simplest systematics for the organization of turn-taking for conversation. Language, 50, 696--735. 90, 92

Google Scholar

[63]

E. Shriberg, A. Stolcke, and S. Ravuri. 2013. Addressee detection for dialog systems using temporal and spectral dimensions of speaking style. In Interspeech 2013, pp. 2559--2563. 92

Google Scholar

[64]

C. Sidner, C. Lee, L.-P. Morency, and C. ForLines. 2006. The effect of head-nod recognition in human-robot conversation. In Proceedings of the 1st Annual Conference on Human-Robot Interaction, pp. 290--296. ACM Press. 95

Digital Library

Google Scholar

[65]

G. Skantze. 2017. Predicting and regulating participation equality in human-robot conversations. In Proceedings of the 2017 ACM/IEEE International Conference on Human-Robot Interaction - HRI '17, pp. 196--204). New York, New York, USA: ACM Press. 92

Digital Library

Google Scholar

[66]

G. Skantze, and S. Al Moubayed. 2012a. IrisTK: a statechart-based toolkit for multi-party face-to-face interaction. In Proceedings of ICMI. Santa Monica, CA. 80, 84, 785

Digital Library

Google Scholar

[67]

G. Skantze, S. Al Moubayed, J. Gustafson, J. Beskow, and B. Granström. 2012b. Furhat at Robotville: A Robot Head Harvesting the Thoughts of the Public through Multi-party Dialogue. In Proceedings of IVA-RCVA. Santa Cruz, CA. 83

Google Scholar

[68]

G. Skantze, A. Hjalmarsson, and C. Oertel. 2014. Turn-taking, Feedback and Joint Attention in Situated Human-Robot Interaction. Speech Communication, 65, 50--66. 94, 96

Crossref

Google Scholar

[69]

G. Skantze, M. Johansson, and J. Beskow. 2015. Exploring Turn-taking Cues in Multi-party Human-robot Discussions about Objects. In Proceedings of ICMI. Seattle, WA. 84, 87, 92

Digital Library

Google Scholar

[70]

G. Skantze. 2016. IrisTK. http://www.iristk.net Accessed September 2016. 84

Google Scholar

[71]

M. Staudte and M. W. Crocker. 2011. Investigating joint attention mechanisms through spoken human-robot interaction. Cognition, 120, 268--291. 96

Crossref

Google Scholar

[72]

K. W. Strabala, M. K. Lee, A. D. Dragan, J. L Forlizzi, S. Srinivasa, M. Cakmak, V. Micelli. 2013. Towards seamless human-robot handovers. Journal of Human-Robot Interaction, 2(1): 112--132. 97

Digital Library

Google Scholar

[73]

W. Sumby and I. Pollack. 1954. Visual Contribution to Speech Intelligibility in Noise. The Journal of the Acoustical Society of America, 26, 212--215. 81

Google Scholar

[74]

M. Sun, A. Schwarz, M. Wu, N. Strom, S. Matsoukas, and S. Vitaladevuni. December 2017. An empirical study of cross-lingual transfer learning techniques for small-footprint keyword spotting. In Machine Learning and Applications (ICMLA), 2017 16th IEEE International Conference, on pp. 255--260. IEEE. 86

Google Scholar

[75]

L. Takayama, W. Ju, and C. Nass. Beyond dirty, dangerous and dull: what everyday people think robots should do. In The 3rd ACM/IEEE International Conference on Human-Robot Interaction, HRI'08, pages 25--32. 77

Digital Library

Google Scholar

[76]

M. Tomasello, B. Hare, H. Lehmann, and J. Call. 2007. Reliance on head versus eyes in the gaze following of great apes and human infants: the cooperative eye hypothesis. Journal of Human Evolution, 52(3): 314--320. 81

Crossref

Google Scholar

[77]

D. Traum and P. Heeman. 1997. Utterance units in spoken dialogue. In In Proceedings of ECAI Workshop on Dialogue Processing in Spoken Language Systems, pp. 125--140.

Digital Library

Google Scholar

[78]

J.-M. Valin, F. Michaud, J. Rouat, and D. Létourneau. 2003. Robust sound source localization using a microphone array on a mobile robot. In Proceedings of Intelligent Robots and Systems (IROS), pp. 1228--1233. 85

Google Scholar

[79]

O. Vinyals, D. Bohus, and R. Caruana. 2012. Learning speaker, addressee and overlap detection models from multimodal streams. In Proceedings of the 14th ACM International Conference on Multimodal Interaction, pp. 417--424. 92

Digital Library

Google Scholar

[80]

P. Viola and M. Jones. 2001. Robust real-time object detection. International Journal of Computer Vision, 4. 88

Digital Library

Google Scholar

[81]

å Wallers, J. Edlund, and G. Skantze. 2006. The effects of prosodic features on the interpretation of synthesised backchannels. In E. André, L. Dybkjaer, W. Minker, H. Neumann, and M. Weber, editors, Proceedings of Perception and Interactive Technologies, pp. 183--187. Springer. 94

Digital Library

Google Scholar

[82]

N. Ward and W. Tsukahara. 2000. Prosodic features which cue back-channel responses in English and Japanese. Journal of Pragmatics, 32(8): 1177--1207. 93

Crossref

Google Scholar

[83]

V. H. Yngve. 1970. On getting a word in edgewise. In Papers from the Sixth Regional Meeting of the Chicago Linguistic Society, pp. 567--578. Chicago, IL. 92

Google Scholar

[84]

M. von Grünau and C. Anston. 1995. The detection of gaze direction: A stare-in-the-crowd effect. Perception, 24(11): 1297--1313. 81

Crossref

Google Scholar

Recommendations

Children's and adults' multimodal interaction with 2D conversational agents
CHI EA '05: CHI '05 Extended Abstracts on Human Factors in Computing Systems

Few systems combine both Embodied Conversational Agents (ECAs) and multimodal input. This research aims at modeling the behavior of adults and children during their multimodal interaction with ECAs. A Wizard-of-Oz setup was used and users were video-...
Conversational gaze mechanisms for humanlike robots

During conversations, speakers employ a number of verbal and nonverbal mechanisms to establish who participates in the conversation, when, and in what capacity. Gaze cues and mechanisms are particularly instrumental in establishing the participant roles ...
Embodied conversational agents in Wizard-of-Oz and multimodal interaction applications
COST 2102'07: Proceedings of the 2007 COST action 2102 international conference on Verbal and nonverbal communication behaviours

Embodied conversational agents employed in multimodal interaction applications have the potential to achieve similar properties as humans in faceto-face conversation. They enable the inclusion of verbal and nonverbal communication. Thus, the degree of ...

Comments

Information & Contributors

Information

Published In

The Handbook of Multimodal-Multisensor Interfaces: Language Processing, Software, Commercialization, and Emerging Directions

July 2019

813 pages

ISBN:9781970001754

DOI:10.1145/3233795

Editors:
Sharon Oviatt
Monash University
,
Björn Schuller
Imperial College London and University of Augsburg
,
Philip R. Cohen
Monash University
,
Daniel Sonntag
German Research Center for Artificial Intelligence (DFKI)
,
Gerasimos Potamianos
University of Thessaly
,
Antonio Krüger
Saarland University and German Research Center for Artificial Intelligence (DFKI)

Publisher

Association for Computing Machinery and Morgan & Claypool

Publication History

Published: 01 July 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article

Appears in

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
197
Total Downloads

Downloads (Last 12 months)11
Downloads (Last 6 weeks)4

Reflects downloads up to 15 Oct 2024

Other Metrics

View Author Metrics

Citations

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Recommendations

Children's and adults' multimodal interaction with 2D conversational agents

Conversational gaze mechanisms for humanlike robots

Embodied conversational agents in Wizard-of-Oz and multimodal interaction applications