Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2388676.2388770acmconferencesArticle/Chapter ViewAbstractPublication Pagesicmi-mlmiConference Proceedingsconference-collections
research-article

Learning speaker, addressee and overlap detection models from multimodal streams

Published: 22 October 2012 Publication History

Abstract

A key challenge in developing conversational systems is fusing streams of information provided by different sensors to make inferences about the behaviors and goals of people. Such systems can leverage visual and audio information collected through cameras and microphone arrays, including the location of various people, their focus of attention, body pose, the sound source direction, prosody, and speech recognition results. In this paper, we explore discriminative learning techniques for making accurate inferences on the problems of speaker, addressee and overlap detection in multiparty human-computer dialog. The focus is on finding ways to leverage within- and across-signal temporal patterns and to automatically construct representations from the raw streams that are informative for the inference problem. We present a novel extension to traditional decision trees which allows them to incorporate and model temporal signals. We contrast these methods with more traditional approaches where a human expert manually engineers relevant temporal features. The proposed approach performs well even with relatively small amounts of training data, which is of practical importance as designing features that are task dependent is time consuming and not always possible.

References

[1]
Bohus, D., and Horvitz, E., (2009). Dialog in the Open World: Platform and Applications, ICMI'09, Boston, MA.
[2]
Bohus D., and Horvitz, E., (2011). Multiparty Turn Taking in Situated Dialog: Study, Lessons, and Directions, SIGdial'2011, Portland, OR.
[3]
Katzenmaier, R., Stiefelhagen, R., and Schultz, T. 2004. Identifying the addressee in human-human-robot interactions based on head pose and speech. In Proceedings of ICMI'04.
[4]
Van Turnhout, K., Terken, J., Bakx, I., and Eggen, B. 2005. Identifying the intended addressee in mixed human-human and human-computer interaction from non-verbal features. In Proceedings of ICMI'05, 175--182.
[5]
Jovanovic, N., Akker, R., and Nijholt, A. (2006). Addressee identification in face-to-face meetings. In Proceedings of the EACL'06, 169--176, 2006.
[6]
Gatica-Perez, D., Lathoud, G., Odobez, J.-M., McCowan, I. 2005. Multimodal multispeaker probabilistic tracking in meetings, In Proceedings of ICMI'05.
[7]
Otsuka, K., Takemae, Y., and Yamato, J., 2005. A probabilistic inference of multiparty-conversation structure based on Markov-switching models of gaze patterns, head directions, and utterances. In Proceedings of ICMI'05.
[8]
Imseng, D., and Friedland, G., 2009. Robust Speaker Diarization for Short Speech Recordings, In Proceedings of ASRU'2009
[9]
Kadous, M. W. 1999. Learning comprehensible descriptions of multivariate time series. In Proceedings of the International Conference on Machine Learning.
[10]
Morency, L. P., de Kok, I., and Gratch, J. 2008. Context-based recognition during human interactions: Automatic feature selection and encoding dictionary. In Proceedings of the 10th international conference on Multimodal interfaces.
[11]
Droppo, J., Seltzer, M. L., Acero, A., and Chiu, Y. H. B. 2008. Towards a non-parametric acoustic model: An acoustic decision tree for observation probability calculation. In Ninth Annual Conference of the International Speech Communication Association.
[12]
Karimi, K. and Hamilton, H. J. 2010. Generation and Interpretation of Temporal Decision Rules. Arxiv preprint arXiv:1004.3334.
[13]
Breiman, L. (2001). Random forests. Machine Learning, 45, 5--32.
[14]
Breiman, L. (1996). Bagging predictors. Machine Learning, 24, 123--140.
[15]
Caruana, R., & Niculescu-Mizil, A. (2006). An empirical comparison of supervised learning algorithms. ICML '06, 161--168.
[16]
Bohus, D., and Horvitz, E. (2011). Decisions about Turns in Multiparty Conversation: From Perception to Action. ICMI'2011, Alicante, Spain.
[17]
Cohen, I. and Goldszmidt, M., 2004. Properties and benefits of calibrated classifiers. In Proceedings of EMCL/PKDD. Pisa, Italy.

Cited By

View all
  • (2022)Admitting the addressee detection faultiness of voice assistants to improve the activation performance using a continuous learning frameworkCognitive Systems Research10.1016/j.cogsys.2021.07.00570:C(65-79)Online publication date: 22-Apr-2022
  • (2020)“Speech Melody and Speech Content Didn’t Fit Together”—Differences in Speech Behavior for Device Directed and Human Directed InteractionsAdvances in Data Science: Methodologies and Applications10.1007/978-3-030-51870-7_4(65-95)Online publication date: 27-Aug-2020
  • (2019)Multimodal conversational interaction with robotsThe Handbook of Multimodal-Multisensor Interfaces10.1145/3233795.3233799(77-104)Online publication date: 1-Jul-2019
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ICMI '12: Proceedings of the 14th ACM international conference on Multimodal interaction
October 2012
636 pages
ISBN:9781450314671
DOI:10.1145/2388676
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 October 2012

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. addressee detection
  2. learning with multimodal temporally streaming data
  3. multimodal inference
  4. multimodal systems
  5. multiparty turn taking
  6. overlap detection
  7. random forests
  8. speaker identification

Qualifiers

  • Research-article

Conference

ICMI '12
Sponsor:
ICMI '12: INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION
October 22 - 26, 2012
California, Santa Monica, USA

Acceptance Rates

Overall Acceptance Rate 453 of 1,080 submissions, 42%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)5
  • Downloads (Last 6 weeks)3
Reflects downloads up to 27 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2022)Admitting the addressee detection faultiness of voice assistants to improve the activation performance using a continuous learning frameworkCognitive Systems Research10.1016/j.cogsys.2021.07.00570:C(65-79)Online publication date: 22-Apr-2022
  • (2020)“Speech Melody and Speech Content Didn’t Fit Together”—Differences in Speech Behavior for Device Directed and Human Directed InteractionsAdvances in Data Science: Methodologies and Applications10.1007/978-3-030-51870-7_4(65-95)Online publication date: 27-Aug-2020
  • (2019)Multimodal conversational interaction with robotsThe Handbook of Multimodal-Multisensor Interfaces10.1145/3233795.3233799(77-104)Online publication date: 1-Jul-2019
  • (2017)Analysis of Small GroupsSocial Signal Processing10.1017/9781316676202.025(349-367)Online publication date: 13-Jul-2017
  • (2016)Real‐Time Coordination in Human‐Robot Interaction Using Face and VoiceAI Magazine10.1609/aimag.v37i4.268637:4(19-31)Online publication date: 1-Dec-2016
  • (2016)Mathematical Model for Processing Multi-user Requests on POMDP Hybrid Dialog ManagementProceedings of the 10th International Conference on Ubiquitous Information Management and Communication10.1145/2857546.2857650(1-4)Online publication date: 4-Jan-2016
  • (2015)A Study of Multimodal Addressee Detection in Human-Human-Computer InteractionIEEE Transactions on Multimedia10.1109/TMM.2015.245433217:9(1550-1561)Online publication date: Sep-2015
  • (2013)Implementation and evaluation of a multimodal addressee identification mechanism for multiparty conversation systemsProceedings of the 15th ACM on International conference on multimodal interaction10.1145/2522848.2522872(35-42)Online publication date: 9-Dec-2013

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media