Cluster-based approach to discriminate the user’s state whether a user is embarrassed or thinking to an answer to a prompt

Chiba, Yuya; Nose, Takashi; Ito, Akinori

doi:10.1007/s12193-017-0238-y

Cluster-based approach to discriminate the user’s state whether a user is embarrassed or thinking to an answer to a prompt

Published: 11 January 2017

Volume 11, pages 185–196, (2017)
Cite this article

Journal on Multimodal User Interfaces Aims and scope Submit manuscript

Yuya Chiba¹,
Takashi Nose¹ &
Akinori Ito¹

286 Accesses
1 Citation
Explore all metrics

Abstract

Spoken dialog systems are employed in various devices to help users operate them. An advantage of a spoken dialog system is that the user can make input utterances freely, but the system sometimes makes it difficult for the user to speak to it. The system should estimate the state of a user who encounters a problem when starting a dialog and then give appropriate help before the user abandons the dialog. Based on this assumption, our research aims to construct a system which responds to a user who does not reply to the system. In this paper, we propose a method of discriminating the user’s state based on vector quantization of non-verbal information such as prosodic features, facial feature points, and gaze. The experimental results showed that the proposed method outperforms the conventional approaches and achieves a discrimination ratio of 72.0%. Then, we examined sequential discrimination for responding to the user at an appropriate timing. The results indicate that the discrimination ratio reached equal to the end of the session at around 6.0 s.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Estimation of User’s State during a Dialog Turn with Sequential Multi-modal Features

Estimating Speaker’s Engagement from Non-verbal Features Based on an Active Listening Corpus

Estimation of User’s Willingness to Talk About the Topic: Analysis of Interviews Between Humans

References

Adelhardt J, Shi R, Frank C, Zeißler V, Batliner A, Nöth E, Niemann H (2003) Multimodal user state recognition in a modern dialogue system. In: Proceedings of the 26th german conference on artificial intelligence, pp 591–605
Arthur D, Vassilvitskii S (2007) k-means++: the advantages of careful seeding. In: Proceedings of the 18th annual ACM-SIAM symposium on discrete algorithms, pp 1027–1035
Brennan SE, Williams M (1995) The feeling of another’s knowing: prosody and filled pauses as cues to listeners about the metacognitive states of speakers. J Mem Lang 34(3):383–398
Article Google Scholar
Callejas Z, Griol D, López-Cózar R (2011) Predicting user mental states in spoken dialogue systems. EURASIP J Adv Signal Process 6:1–21
Google Scholar
Chiba Y, Ito A (2012) Estimating a user’s internal state before the first input utterance. Adv Hum Comput Inter. doi:10.1155/2012/865362
Google Scholar
Chiba Y, Ito M, Ito A (2012) Effect of linguistic contents on human estimation of internal state of dialog system users. In: Proceedings of the interdisciplinary workshop on feedback behaviors in dialog, pp 11–14
Chiba Y, Ito M, Ito A (2013) Estimation of user’s state during a dialog turn with sequential multi-modal features. In: HCI international 2013-posters’ extended abstracts, pp 572–576
Chiba Y, Ito M, Ito A (2014a) Modeling user’s state during dialog turn using HMM for multi-modal spoken dialog system. In: Proceedigs of the 7th international conference on advances in computer–human interactions, pp 343–346
Chiba Y, Nose T, Ito A, Ito M (2014b) User modeling by using bag-of-behaviors for building a dialog system sensitive to the interlocutor’s internal state. In: Proceedings of the 15th annual meeting of the special interest group on discourse and dialogue, pp 74–78
Collignon O, Girard S, Gosselin F, Roy S, Saint-Amour D, Lassonde M, Lepore F (2008) Audio-visual integration of emotion expression. Brain Res 1242:126–135
Article Google Scholar
Csurka G, Dance C, Fan L, Willamowski J, Bray C (2004) Visual categorization with bags of keypoints. In: Proceedings of the workshop on statistical learning in computer vision, pp 1–22
de Rosis F, Novielli N, Carofiglio V, Cavalluzzi A, de Carolis B (2006) User modeling and adaptation in health promotion dialogs with an animated character. J Biomed Inform 39(5):514–531
Article Google Scholar
Ding C, He X (2004) K-means clustering via principal component analysis. In: Proceediings of the 21st international conference on machine learning, pp 225–232
Forbes-Riley K, Litman D (2011a) Benefits and challenges of real-time uncertainty detection and adaptation in a spoken dialogue computer tutor. Speech Commun 53(9–10):1115–1136
Article Google Scholar
Forbes-Riley K, Litman D (2011b) Designing and evaluating a wizarded uncertainty-adaptive spoken dialogue tutoring system. Comput Speech Lang 25(1):105–126
Article Google Scholar
Griol D, Molina JM, Callejas Z (2014) Modeling the user state for context-aware spoken interaction in ambient assisted living. Appl Intell 40(4):749–771
Article Google Scholar
Hudson S, Fogarty J, Atkeson C, Avrahami D, Forlizzi J, Kiesler S, Lee J, Yang J (2003) Predicting human interruptibility with sensors: a Wizard of Oz feasibility study. In: Proceedings of the SIGCHI conference on human factors in computing systems, pp 257–264
Jiang YG, Ngo CW, Yang J (2007) Towards optimal bag-of-features for object categorization and semantic video retrieval. In: Proceedings of the 6th ACM international conference on image and video retrieval, pp 494–501
Jokinen K, Kanto K (2004) User expertise modelling and adaptivity in a speech-based e-mail system. In: Proceedings of the 42nd annual meeting on association for computational linguistics, pp 88–95
Kobayashi A, Kayama K, Mizukami E, Misu T, Kashioka H, Kawai H, Nakamura S (2010) Evaluation of facial direction estimation from cameras for multi-modal spoken dialog system. In: Proceedings of the international workshop on spoken dialogue systems technology, pp 73–84
Koda T, Maes P (1996) Agents with faces: the effect of personification. In: Proceedings of the IEEE international workshop on robot and human communication, pp 189–194
Lin JC, Wu CH, Wei WL (2012) Error weighted semi-coupled hidden Markov model for audio-visual emotion recognition. IEEE Trans Multimed 14(1):142–156
Article Google Scholar
Metallinou A, Wollmer M, Katsamanis A, Eyben F, Schuller B, Narayanan S (2012) Context-sensitive learning for enhanced audiovisual emotion classification. IEEE Trans Affect Comput 3(2):184–198
Article Google Scholar
Michalowski MP, Sabanovic S, Simmons R (2006) A spatial model of engagement for a social robot. In: Proceedings of the 9th IEEE international workshop on advanced motion control, pp 762–767
Natarajan P, Wu S, Vitaladevuni S, Zhuang X, Tsakalidis S, Park U, Prasad R, Natarajan P (2012) Multimodal feature fusion for robust event detection in web videos. In: Proceedings of computer vision and pattern recognition, pp 1298–1305
Paliwal KK, Atal BS (1993) Efficient vector quantization of LPC parameters at 24 bits/frame. IEEE Trans Speech Audio Process 1:3–14
Article Google Scholar
Pargellis AN, Kuo HKJ, Lee CH (2004) An automatic dialogue generation platform for personalized dialogue applications. Speech Commun 42(3–4):329–351
Article Google Scholar
Paulmann S, Pell MD (2011) Is there an advantage for recognizing multi-modal emotional stimuli? Motiv Emot 35(2):192–201
Article Google Scholar
Pon-Barry H, Schultz K, Bratt EO, Clark B, Peters S (2006) Responding to student uncertainty in spoken tutorial dialogue systems. Int J Artif Intell Educ 16(2):171–194
Google Scholar
Saragih JM, Lucey S, Cohn JF (2011) Deformable model fitting by regularized landmark mean-shift. Int J Comput Vis 91(2):200–215
Article MathSciNet MATH Google Scholar
Satake S, Kanda T, Glas DF, Imai M, Ishiguro H, Hagita N (2009) How to approach humans? Strategies for social robots to initiate interaction. In: Proceedings of the 4th ACM/IEEE international conference on human–robot interaction, pp 109–116
Sonnenburg S, Rätsch G, Schäfer C, Schölkopf B (2006) Large scale multiple kernel learning. J Mach Learn Res 7:1531–1565
MathSciNet MATH Google Scholar
Swerts M, Krahmer E (2005) Audiovisual prosody and feeling of knowing. J Mem Lang 53(1):81–94
Article Google Scholar
Walker JH, Sproull L, Subramani R (1994) Using a human face in an interface. In: Proceedings of the SIGCHI conference on human factors in computing systems, pp 85–91
Wang Y, Guan L, Venetsanopoulos AN (2012) Kernel cross-modal factor analysis for information fusion with application to bimodal emotion recognition. IEEE Trans Multimed 14(3):597–607
Article Google Scholar
Wöllmer M, Kaiser M, Eyben F, Schuller B, Rigoll G (2013) LSTM-modeling of continuous emotions in an audiovisual affect recognition framework. Image Vis Comput 31(2):153–163
Article Google Scholar
Zeng Z, Pantic M, Roisman GI, Huang TS (2009) A survey of affect recognition methods: Audio, visual, and spontaneous expressions. IEEE Trans Pattern Anal Mach Intell 31(1):39–58
Article Google Scholar

Download references

Acknowledgements

Funding was provided by Grant-in-Aid for JSPS Research Fellow (Grant No. 263989) and Grants-in-Aid for Scientific Research (Grant No. JP15H02720).

Author information

Authors and Affiliations

Graduate School of Engineering, Tohoku University, 6-6-05, Aramaki Aza Aoba Aoba-ku, Sendai, Miyagi, Japan
Yuya Chiba, Takashi Nose & Akinori Ito

Authors

Yuya Chiba
View author publications
You can also search for this author in PubMed Google Scholar
Takashi Nose
View author publications
You can also search for this author in PubMed Google Scholar
Akinori Ito
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yuya Chiba.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chiba, Y., Nose, T. & Ito, A. Cluster-based approach to discriminate the user’s state whether a user is embarrassed or thinking to an answer to a prompt. J Multimodal User Interfaces 11, 185–196 (2017). https://doi.org/10.1007/s12193-017-0238-y

Download citation

Received: 25 November 2015
Accepted: 31 December 2016
Published: 11 January 2017
Issue Date: June 2017
DOI: https://doi.org/10.1007/s12193-017-0238-y

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Cluster-based approach to discriminate the user’s state whether a user is embarrassed or thinking to an answer to a prompt

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Estimation of User’s State during a Dialog Turn with Sequential Multi-modal Features

Estimating Speaker’s Engagement from Non-verbal Features Based on an Active Listening Corpus

Estimation of User’s Willingness to Talk About the Topic: Analysis of Interviews Between Humans

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Cluster-based approach to discriminate the user’s state whether a user is embarrassed or thinking to an answer to a prompt

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Estimation of User’s State during a Dialog Turn with Sequential Multi-modal Features

Estimating Speaker’s Engagement from Non-verbal Features Based on an Active Listening Corpus

Estimation of User’s Willingness to Talk About the Topic: Analysis of Interviews Between Humans

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation