Abstract
Spoken dialog systems are employed in various devices to help users operate them. An advantage of a spoken dialog system is that the user can make input utterances freely, but the system sometimes makes it difficult for the user to speak to it. The system should estimate the state of a user who encounters a problem when starting a dialog and then give appropriate help before the user abandons the dialog. Based on this assumption, our research aims to construct a system which responds to a user who does not reply to the system. In this paper, we propose a method of discriminating the user’s state based on vector quantization of non-verbal information such as prosodic features, facial feature points, and gaze. The experimental results showed that the proposed method outperforms the conventional approaches and achieves a discrimination ratio of 72.0%. Then, we examined sequential discrimination for responding to the user at an appropriate timing. The results indicate that the discrimination ratio reached equal to the end of the session at around 6.0 s.
Similar content being viewed by others
References
Adelhardt J, Shi R, Frank C, Zeißler V, Batliner A, Nöth E, Niemann H (2003) Multimodal user state recognition in a modern dialogue system. In: Proceedings of the 26th german conference on artificial intelligence, pp 591–605
Arthur D, Vassilvitskii S (2007) k-means++: the advantages of careful seeding. In: Proceedings of the 18th annual ACM-SIAM symposium on discrete algorithms, pp 1027–1035
Brennan SE, Williams M (1995) The feeling of another’s knowing: prosody and filled pauses as cues to listeners about the metacognitive states of speakers. J Mem Lang 34(3):383–398
Callejas Z, Griol D, López-Cózar R (2011) Predicting user mental states in spoken dialogue systems. EURASIP J Adv Signal Process 6:1–21
Chiba Y, Ito A (2012) Estimating a user’s internal state before the first input utterance. Adv Hum Comput Inter. doi:10.1155/2012/865362
Chiba Y, Ito M, Ito A (2012) Effect of linguistic contents on human estimation of internal state of dialog system users. In: Proceedings of the interdisciplinary workshop on feedback behaviors in dialog, pp 11–14
Chiba Y, Ito M, Ito A (2013) Estimation of user’s state during a dialog turn with sequential multi-modal features. In: HCI international 2013-posters’ extended abstracts, pp 572–576
Chiba Y, Ito M, Ito A (2014a) Modeling user’s state during dialog turn using HMM for multi-modal spoken dialog system. In: Proceedigs of the 7th international conference on advances in computer–human interactions, pp 343–346
Chiba Y, Nose T, Ito A, Ito M (2014b) User modeling by using bag-of-behaviors for building a dialog system sensitive to the interlocutor’s internal state. In: Proceedings of the 15th annual meeting of the special interest group on discourse and dialogue, pp 74–78
Collignon O, Girard S, Gosselin F, Roy S, Saint-Amour D, Lassonde M, Lepore F (2008) Audio-visual integration of emotion expression. Brain Res 1242:126–135
Csurka G, Dance C, Fan L, Willamowski J, Bray C (2004) Visual categorization with bags of keypoints. In: Proceedings of the workshop on statistical learning in computer vision, pp 1–22
de Rosis F, Novielli N, Carofiglio V, Cavalluzzi A, de Carolis B (2006) User modeling and adaptation in health promotion dialogs with an animated character. J Biomed Inform 39(5):514–531
Ding C, He X (2004) K-means clustering via principal component analysis. In: Proceediings of the 21st international conference on machine learning, pp 225–232
Forbes-Riley K, Litman D (2011a) Benefits and challenges of real-time uncertainty detection and adaptation in a spoken dialogue computer tutor. Speech Commun 53(9–10):1115–1136
Forbes-Riley K, Litman D (2011b) Designing and evaluating a wizarded uncertainty-adaptive spoken dialogue tutoring system. Comput Speech Lang 25(1):105–126
Griol D, Molina JM, Callejas Z (2014) Modeling the user state for context-aware spoken interaction in ambient assisted living. Appl Intell 40(4):749–771
Hudson S, Fogarty J, Atkeson C, Avrahami D, Forlizzi J, Kiesler S, Lee J, Yang J (2003) Predicting human interruptibility with sensors: a Wizard of Oz feasibility study. In: Proceedings of the SIGCHI conference on human factors in computing systems, pp 257–264
Jiang YG, Ngo CW, Yang J (2007) Towards optimal bag-of-features for object categorization and semantic video retrieval. In: Proceedings of the 6th ACM international conference on image and video retrieval, pp 494–501
Jokinen K, Kanto K (2004) User expertise modelling and adaptivity in a speech-based e-mail system. In: Proceedings of the 42nd annual meeting on association for computational linguistics, pp 88–95
Kobayashi A, Kayama K, Mizukami E, Misu T, Kashioka H, Kawai H, Nakamura S (2010) Evaluation of facial direction estimation from cameras for multi-modal spoken dialog system. In: Proceedings of the international workshop on spoken dialogue systems technology, pp 73–84
Koda T, Maes P (1996) Agents with faces: the effect of personification. In: Proceedings of the IEEE international workshop on robot and human communication, pp 189–194
Lin JC, Wu CH, Wei WL (2012) Error weighted semi-coupled hidden Markov model for audio-visual emotion recognition. IEEE Trans Multimed 14(1):142–156
Metallinou A, Wollmer M, Katsamanis A, Eyben F, Schuller B, Narayanan S (2012) Context-sensitive learning for enhanced audiovisual emotion classification. IEEE Trans Affect Comput 3(2):184–198
Michalowski MP, Sabanovic S, Simmons R (2006) A spatial model of engagement for a social robot. In: Proceedings of the 9th IEEE international workshop on advanced motion control, pp 762–767
Natarajan P, Wu S, Vitaladevuni S, Zhuang X, Tsakalidis S, Park U, Prasad R, Natarajan P (2012) Multimodal feature fusion for robust event detection in web videos. In: Proceedings of computer vision and pattern recognition, pp 1298–1305
Paliwal KK, Atal BS (1993) Efficient vector quantization of LPC parameters at 24 bits/frame. IEEE Trans Speech Audio Process 1:3–14
Pargellis AN, Kuo HKJ, Lee CH (2004) An automatic dialogue generation platform for personalized dialogue applications. Speech Commun 42(3–4):329–351
Paulmann S, Pell MD (2011) Is there an advantage for recognizing multi-modal emotional stimuli? Motiv Emot 35(2):192–201
Pon-Barry H, Schultz K, Bratt EO, Clark B, Peters S (2006) Responding to student uncertainty in spoken tutorial dialogue systems. Int J Artif Intell Educ 16(2):171–194
Saragih JM, Lucey S, Cohn JF (2011) Deformable model fitting by regularized landmark mean-shift. Int J Comput Vis 91(2):200–215
Satake S, Kanda T, Glas DF, Imai M, Ishiguro H, Hagita N (2009) How to approach humans? Strategies for social robots to initiate interaction. In: Proceedings of the 4th ACM/IEEE international conference on human–robot interaction, pp 109–116
Sonnenburg S, Rätsch G, Schäfer C, Schölkopf B (2006) Large scale multiple kernel learning. J Mach Learn Res 7:1531–1565
Swerts M, Krahmer E (2005) Audiovisual prosody and feeling of knowing. J Mem Lang 53(1):81–94
Walker JH, Sproull L, Subramani R (1994) Using a human face in an interface. In: Proceedings of the SIGCHI conference on human factors in computing systems, pp 85–91
Wang Y, Guan L, Venetsanopoulos AN (2012) Kernel cross-modal factor analysis for information fusion with application to bimodal emotion recognition. IEEE Trans Multimed 14(3):597–607
Wöllmer M, Kaiser M, Eyben F, Schuller B, Rigoll G (2013) LSTM-modeling of continuous emotions in an audiovisual affect recognition framework. Image Vis Comput 31(2):153–163
Zeng Z, Pantic M, Roisman GI, Huang TS (2009) A survey of affect recognition methods: Audio, visual, and spontaneous expressions. IEEE Trans Pattern Anal Mach Intell 31(1):39–58
Acknowledgements
Funding was provided by Grant-in-Aid for JSPS Research Fellow (Grant No. 263989) and Grants-in-Aid for Scientific Research (Grant No. JP15H02720).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Chiba, Y., Nose, T. & Ito, A. Cluster-based approach to discriminate the user’s state whether a user is embarrassed or thinking to an answer to a prompt. J Multimodal User Interfaces 11, 185–196 (2017). https://doi.org/10.1007/s12193-017-0238-y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12193-017-0238-y