research-article

Learning speaker, addressee and overlap detection models from multimodal streams

Authors:

Rich CaruanaAuthors Info & Claims

ICMI '12: Proceedings of the 14th ACM international conference on Multimodal interaction

Pages 417 - 424

https://doi.org/10.1145/2388676.2388770

Published: 22 October 2012 Publication History

Abstract

A key challenge in developing conversational systems is fusing streams of information provided by different sensors to make inferences about the behaviors and goals of people. Such systems can leverage visual and audio information collected through cameras and microphone arrays, including the location of various people, their focus of attention, body pose, the sound source direction, prosody, and speech recognition results. In this paper, we explore discriminative learning techniques for making accurate inferences on the problems of speaker, addressee and overlap detection in multiparty human-computer dialog. The focus is on finding ways to leverage within- and across-signal temporal patterns and to automatically construct representations from the raw streams that are informative for the inference problem. We present a novel extension to traditional decision trees which allows them to incorporate and model temporal signals. We contrast these methods with more traditional approaches where a human expert manually engineers relevant temporal features. The proposed approach performs well even with relatively small amounts of training data, which is of practical importance as designing features that are task dependent is time consuming and not always possible.

References

[1]

Bohus, D., and Horvitz, E., (2009). Dialog in the Open World: Platform and Applications, ICMI'09, Boston, MA.

Digital Library

[2]

Bohus D., and Horvitz, E., (2011). Multiparty Turn Taking in Situated Dialog: Study, Lessons, and Directions, SIGdial'2011, Portland, OR.

Digital Library

[3]

Katzenmaier, R., Stiefelhagen, R., and Schultz, T. 2004. Identifying the addressee in human-human-robot interactions based on head pose and speech. In Proceedings of ICMI'04.

Digital Library

[4]

Van Turnhout, K., Terken, J., Bakx, I., and Eggen, B. 2005. Identifying the intended addressee in mixed human-human and human-computer interaction from non-verbal features. In Proceedings of ICMI'05, 175--182.

Digital Library

[5]

Jovanovic, N., Akker, R., and Nijholt, A. (2006). Addressee identification in face-to-face meetings. In Proceedings of the EACL'06, 169--176, 2006.

[6]

Gatica-Perez, D., Lathoud, G., Odobez, J.-M., McCowan, I. 2005. Multimodal multispeaker probabilistic tracking in meetings, In Proceedings of ICMI'05.

Digital Library

[7]

Otsuka, K., Takemae, Y., and Yamato, J., 2005. A probabilistic inference of multiparty-conversation structure based on Markov-switching models of gaze patterns, head directions, and utterances. In Proceedings of ICMI'05.

Digital Library

[8]

Imseng, D., and Friedland, G., 2009. Robust Speaker Diarization for Short Speech Recordings, In Proceedings of ASRU'2009

[9]

Kadous, M. W. 1999. Learning comprehensible descriptions of multivariate time series. In Proceedings of the International Conference on Machine Learning.

Digital Library

[10]

Morency, L. P., de Kok, I., and Gratch, J. 2008. Context-based recognition during human interactions: Automatic feature selection and encoding dictionary. In Proceedings of the 10th international conference on Multimodal interfaces.

Digital Library

[11]

Droppo, J., Seltzer, M. L., Acero, A., and Chiu, Y. H. B. 2008. Towards a non-parametric acoustic model: An acoustic decision tree for observation probability calculation. In Ninth Annual Conference of the International Speech Communication Association.

[12]

Karimi, K. and Hamilton, H. J. 2010. Generation and Interpretation of Temporal Decision Rules. Arxiv preprint arXiv:1004.3334.

[13]

Breiman, L. (2001). Random forests. Machine Learning, 45, 5--32.

Digital Library

[14]

Breiman, L. (1996). Bagging predictors. Machine Learning, 24, 123--140.

[15]

Caruana, R., & Niculescu-Mizil, A. (2006). An empirical comparison of supervised learning algorithms. ICML '06, 161--168.

Digital Library

[16]

Bohus, D., and Horvitz, E. (2011). Decisions about Turns in Multiparty Conversation: From Perception to Action. ICMI'2011, Alicante, Spain.

Digital Library

[17]

Cohen, I. and Goldszmidt, M., 2004. Properties and benefits of calibrated classifiers. In Proceedings of EMCL/PKDD. Pisa, Italy.

Digital Library

Cited By

Siegert IWeißkirchen NKrüger JAkhtiamov OWendemuth A(2022)Admitting the addressee detection faultiness of voice assistants to improve the activation performance using a continuous learning frameworkCognitive Systems Research10.1016/j.cogsys.2021.07.00570:C(65-79)Online publication date: 22-Apr-2022
https://dl.acm.org/doi/10.1016/j.cogsys.2021.07.005
Siegert IKrüger J(2020)“Speech Melody and Speech Content Didn’t Fit Together”—Differences in Speech Behavior for Device Directed and Human Directed InteractionsAdvances in Data Science: Methodologies and Applications10.1007/978-3-030-51870-7_4(65-95)Online publication date: 27-Aug-2020
https://doi.org/10.1007/978-3-030-51870-7_4
Skantze GGustafson JBeskow J(2019)Multimodal conversational interaction with robotsThe Handbook of Multimodal-Multisensor Interfaces10.1145/3233795.3233799(77-104)Online publication date: 1-Jul-2019
https://dl.acm.org/doi/10.1145/3233795.3233799
Show More Cited By

Index Terms

Learning speaker, addressee and overlap detection models from multimodal streams
1. Computing methodologies
  1. Artificial intelligence
    1. Philosophical/theoretical foundations of artificial intelligence
      1. Cognitive science
2. Human-centered computing
  1. Human computer interaction (HCI)
    1. Interaction paradigms
      1. Natural language interfaces

Recommendations

Decisions about turns in multiparty conversation: from perception to action
ICMI '11: Proceedings of the 13th international conference on multimodal interfaces

We present a decision-theoretic approach for guiding turn taking in a spoken dialog system operating in multiparty settings. The proposed methodology couples inferences about multiparty conversational dynamics with assessed costs of different outcomes, ...
From vocal to multimodal dialogue management
ICMI '06: Proceedings of the 8th international conference on Multimodal interfaces

Multimodal, speech-enabled systems pose different research problems when compared to unimodal, voice-only dialogue systems. One of the important issues is the question of how a multimodal interface should look like in order to make the multimodal ...
Multimodal multiparty social interaction with the furhat head
ICMI '12: Proceedings of the 14th ACM international conference on Multimodal interaction

We will show in this demonstrator an advanced multimodal and multiparty spoken conversational system using Furhat, a robot head based on projected facial animation. Furhat is a human-like interface that utilizes facial animation for physical robot heads ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ICMI '12: Proceedings of the 14th ACM international conference on Multimodal interaction

October 2012

636 pages

ISBN:9781450314671

DOI:10.1145/2388676

General Chairs:
Louis-Philippe Morency
University of Southern California, USA
,
Dan Bohus
Microsoft Research, USA
,
Hamid Aghajan
Stanford University, USA
,
Program Chairs:
Justine Cassell
Carnegie Mellon University, USA
,
Anton Nijholt
University of Twente, Netherlands
,
Julien Epps
The University of New South Wales, Australia

Copyright © 2012 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGCHI: ACM Special Interest Group on Computer-Human Interaction

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 October 2012

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

ICMI '12

Sponsor:

SIGCHI

ICMI '12: INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION

October 22 - 26, 2012

California, Santa Monica, USA

Acceptance Rates

Overall Acceptance Rate 453 of 1,080 submissions, 42%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

8
Total Citations
View Citations
235
Total Downloads

Downloads (Last 12 months)5
Downloads (Last 6 weeks)3

Reflects downloads up to 27 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Siegert IWeißkirchen NKrüger JAkhtiamov OWendemuth A(2022)Admitting the addressee detection faultiness of voice assistants to improve the activation performance using a continuous learning frameworkCognitive Systems Research10.1016/j.cogsys.2021.07.00570:C(65-79)Online publication date: 22-Apr-2022
https://dl.acm.org/doi/10.1016/j.cogsys.2021.07.005
Siegert IKrüger J(2020)“Speech Melody and Speech Content Didn’t Fit Together”—Differences in Speech Behavior for Device Directed and Human Directed InteractionsAdvances in Data Science: Methodologies and Applications10.1007/978-3-030-51870-7_4(65-95)Online publication date: 27-Aug-2020
https://doi.org/10.1007/978-3-030-51870-7_4
Skantze GGustafson JBeskow J(2019)Multimodal conversational interaction with robotsThe Handbook of Multimodal-Multisensor Interfaces10.1145/3233795.3233799(77-104)Online publication date: 1-Jul-2019
https://dl.acm.org/doi/10.1145/3233795.3233799
Gatica-Perez DAran OJayagopi D(2017)Analysis of Small GroupsSocial Signal Processing10.1017/9781316676202.025(349-367)Online publication date: 13-Jul-2017
https://doi.org/10.1017/9781316676202.025
Skantze G(2016)Real‐Time Coordination in Human‐Robot Interaction Using Face and VoiceAI Magazine10.1609/aimag.v37i4.268637:4(19-31)Online publication date: 1-Dec-2016
https://dl.acm.org/doi/10.1609/aimag.v37i4.2686
Koo SLee GYu H(2016)Mathematical Model for Processing Multi-user Requests on POMDP Hybrid Dialog ManagementProceedings of the 10th International Conference on Ubiquitous Information Management and Communication10.1145/2857546.2857650(1-4)Online publication date: 4-Jan-2016
https://dl.acm.org/doi/10.1145/2857546.2857650
Tsai TStolcke ASlaney M(2015)A Study of Multimodal Addressee Detection in Human-Human-Computer InteractionIEEE Transactions on Multimedia10.1109/TMM.2015.245433217:9(1550-1561)Online publication date: Sep-2015
https://doi.org/10.1109/TMM.2015.2454332
Nakano YBaba NHuang HHayashi YEpps JChen FOviatt SMase KSears AJokinen KSchuller B(2013)Implementation and evaluation of a multimodal addressee identification mechanism for multiparty conversation systemsProceedings of the 15th ACM on International conference on multimodal interaction10.1145/2522848.2522872(35-42)Online publication date: 9-Dec-2013
https://dl.acm.org/doi/10.1145/2522848.2522872

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten