research-article

Articulation constrained learning with application to speech emotion recognition

Authors:

Chaitali Chakrabarti,

Andreas SpaniasAuthors Info & Claims

Volume 2019, Issue 1

Pages 1 - 17

https://doi.org/10.1186/s13636-019-0157-9

Published: 01 December 2019 Publication History

Abstract

Speech emotion recognition methods combining articulatory information with acoustic features have been previously shown to improve recognition performance. Collection of articulatory data on a large scale may not be feasible in many scenarios, thus restricting the scope and applicability of such methods. In this paper, a discriminative learning method for emotion recognition using both articulatory and acoustic information is proposed. A traditional ℓ ₁-regularized logistic regression cost function is extended to include additional constraints that enforce the model to reconstruct articulatory data. This leads to sparse and interpretable representations jointly optimized for both tasks simultaneously. Furthermore, the model only requires articulatory features during training; only speech features are required for inference on out-of-sample data. Experiments are conducted to evaluate emotion recognition performance over vowels /AA/, /AE/, /IY/, /UW/ and complete utterances. Incorporating articulatory information is shown to significantly improve the performance for valence-based classification. Results obtained for within-corpus and cross-corpus categorical emotion recognition indicate that the proposed method is more effective at distinguishing happiness from other emotions.

References

[1]

P. Partila, J. Tovarek, J. Frnda, M. Voznak, M. Penhaker, T. Peterek, in Intelligent Data Analysis and Its Applications, Volume II. Emotional impact on neurological characteristics and human speech (Springer, 2014), pp. 527–533.

[2]

R. Cowie, R. R. Cornelius, Describing the emotional states that are expressed in speech. Speech Comm.40(1), 5–32 (2003).

[3]

K. R. Scherer, Vocal affect expression: a review and a model for future research. Psychol. Bull.99(2), 143 (1986).

[4]

L. Vidrascu, L. Devillers, Detection of real-life emotions in call centers. Proc. INTERSPEECH, 1841–1844 (2005).

[5]

G. I. Roisman, J. L. Tsai, K. -H. S. Chiang, The emotional integration of childhood experience: physiological, facial expressive, and self-reported emotional response during the adult attachment interview. Dev. Psychol.40(5), 776 (2004).

[6]

S. Narayanan, P. G. Georgiou, Behavioral signal processing: deriving human behavioral informatics from speech and language. Proc. IEEE. 101(5), 1203–1233 (2013).

[7]

A. Metallinou, S. Lee, S. Narayanan, Decision level combination of multiple modalities for recognition and analysis of emotional expression, (2010).

[8]

Z. Zeng, M. Pantic, G. I. Roisman, T. S. Huang, A survey of affect recognition methods: audio, visual, and spontaneous expressions. IEEE Trans. Patt. Anal. Mach. Intell.31(1), 39–58 (2009).

Digital Library

[9]

E. Mower, M. J. Mataric, S. Narayanan, A framework for automatic human emotion classification using emotion profiles. IEEE Trans. Audio Speech Lang. Process.19(5), 1057–1070 (2011).

[10]

Z. Aldeneh, S. Khorram, D. Dimitriadis, E. M. Provost, in Proceedings of the 19 ^th ACM International Conference on Multimodal Interaction. Pooling acoustic and lexical features for the prediction of valence (ACM, 2017), pp. 68–72.

[11]

D. Erickson, O. Fujimura, B. Pardo, Articulatory correlates of prosodic control: Emotion and emphasis. Lang. Speech. 41(3-4), 399–417 (1998).

[12]

M. Nordstrand, G. Svanfeldt, B. Granström, D. House, Measurements of articulatory variation in expressive speech for a set of swedish vowels. Speech Comm.44(1), 187–196 (2004).

[13]

S. Lee, S. Yildirim, A. Kazemzadeh, S. Narayanan, An articulatory study of emotional speech production. Proc. INTERSPEECH, 497–500 (2005).

[14]

T. L. Nwe, S. W. Foo, L. C. De Silva, Speech emotion recognition using hidden Markov models. Speech Commun.41(4), 603–623 (2003).

[15]

C. M. Lee, S. S. Narayanan, Toward detecting emotions in spoken dialogs. IEEE Trans. Speech Audio Process. 13(2), 293–303 (2005).

[16]

B. Schuller, A. Batliner, S. Steidl, D. Seppi, Recognising realistic emotions and affect in speech: state of the art and lessons learnt from the first challenge. Speech Commun.53(9), 1062–1087 (2011).

[17]

M. El Ayadi, M. S. Kamel, F. Karray, Survey on speech emotion recognition: features, classification schemes, and databases. Pattern Recog.44(3), 572–587 (2011).

[18]

M. Shah, C. Chakrabarti, A. Spanias, Within and cross-corpus speech emotion recognition using latent topic model-based features. EURASIP J. Audio Speech Music Process. 2015(1), 1–17 (2015).

[19]

G. Zhou, J. H. Hansen, J. F. Kaiser, Proc. IEEE Int. Conf. Acoust. Speech Signal Process.1:, 549–552 (1998).

[20]

C. E. Williams, K. N. Stevens, Emotions and speech: some acoustical correlates. J. Acoust. Soc. Am.52(4B), 1238–1250 (2005).

[21]

B. Schuller, A. Batliner, D. Seppi, S. Steidl, T. Vogt, J. Wagner, L. Devillers, L. Vidrascu, N. Amir, L. Kessous, et al., The relevance of feature type for the automatic classification of emotional user states: low level descriptors and functionals. Proc. INTERSPEECH, 1–4 (2007).

[22]

C. Busso, S. Lee, S. Narayanan, Analysis of emotionally salient aspects of fundamental frequency for emotion detection. IEEE Tran. Audio Speech Lang. Process. 17(4), 582–596 (2009).

Digital Library

[23]

B. Schuller, S. Steidl, A. Batliner, The INTERSPEECH 2009 emotion challenge. Proc. INTERSPEECH, 312–315 (2009).

[24]

C. -C. Lee, E. Mower, C. Busso, S. Lee, S. Narayanan, Emotion recognition using a hierarchical binary decision tree approach. Speech Commun.53(9), 1162–1171 (2011).

[25]

F. Eyben, M. Wollmer, B. Schuller, OpenEAR—introducing the Munich open-source emotion and affect recognition toolkit. Int. Conf. Affect. Comput. Intell. Interact. Workshops, 1–6 (2009).

[26]

J. Nicholson, K. Takahashi, R. Nakatsu, Emotion recognition in speech using neural networks. Neural Comput. Applic.9(4), 290–296 (2000).

[27]

O. -W. Kwon, K. Chan, J. Hao, T. -W. Lee, Emotion recognition by speech signals. Proc. INTERSPEECH (2003).

[28]

C. -C. Lee, E. Mower, C. Busso, S. Lee, S. Narayanan, Emotion recognition using a hierarchical binary decision tree approach. Speech Commun.53(9), 1162–1171 (2011).

[29]

S. Parthasarathy, R. Cowie, C. Busso, Using agreement on direction of change to build rank-based emotion classifiers. IEEE/ACM Trans. Audio Speech Lang. Process.24(11), 2108–2121 (2016).

[30]

Z. Huang, J. Epps, in Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference On. A PLLR and multi-stage staircase regression framework for speech-based emotion prediction (IEEE, 2017), pp. 5145–5149.

[31]

S. Parthasarathy, C. Busso, Jointly predicting arousal, valence and dominance with multi-task learning. INTERSPEECH Stockholm Sweden (2017).

[32]

D. Le, Z. Aldeneh, E. M. Provost, Discretized Continuous Speech Emotion Recognition with Multi-Task Deep Recurrent Neural Network. In INTERSPEECH, 1108–1112 (2017).

[33]

S. Khorram, Z. Aldeneh, D. Dimitriadis, M. McInnis, E. M. Provost, Capturing long-term temporal dependencies with convolutional networks for continuous emotion recognition. Proc. Interspeech 2017, 1253–1257 (2017).

[34]

Z. Aldeneh, E. M. Provost, in Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference On. Using regional saliency for speech emotion recognition (IEEE, 2017), pp. 2741–2745.

[35]

S. Mirsamadi, E. Barsoum, C. Zhang, in Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference On. Automatic speech emotion recognition using recurrent neural networks with local attention (IEEE, 2017), pp. 2227–2231.

[36]

M. Shah, C. Chakrabarti, A. Spanias, A multi-modal approach to emotion recognition using undirected topic models, (2014).

[37]

P. K. Ghosh, S. Narayanan, A generalized smoothness criterion for acoustic-to-articulatory inversion. J. Acoust. Soc. Am.128(4), 2162–2172 (2010).

[38]

J. Kim, P. Ghosh, S. Lee, S. S. Narayanan, in Signal & Information Processing Association Annual Summit and Conference (APSIPA ASC), 2012 Asia-Pacific. A study of emotional information present in articulatory movements estimated using acoustic-to-articulatory inversion (IEEE, 2012), pp. 1–4.

[39]

P. K. Ghosh, S. Narayanan, Automatic speech recognition using articulatory features from subject-independent acoustic-to-articulatory inversion. J. Acoust. Soc. Am.130(4), 251–257 (2011).

[40]

L. Badino, C. Canevari, L. Fadiga, G. Metta, Integrating articulatory data in deep neural network-based acoustic modeling. Comput. Speech Lang.36:, 173–195 (2016).

[41]

M. Li, J. Kim, A. Lammert, P. K. Ghosh, V. Ramanarayanan, S. Narayanan, Speaker verification based on the fusion of speech acoustics and inverted articulatory signals. Comput. Speech Lang.36:, 196–211 (2016).

[42]

D. Ververidis, C. Kotropoulos, in Proc. Panhellenic Conference on Informatics (PCI). A review of emotional speech databases, (2003), pp. 560–574.

[43]

J. Deng, Z. Zhang, E. Marchi, B. Schuller, in Affective Computing and Intelligent Interaction (ACII), 2013 Humaine Association Conference On. Sparse autoencoder-based feature transfer learning for speech emotion recognition (IEEE, 2013), pp. 511–516.

[44]

P. Song, Y. Jin, L. Zhao, M. Xin, Speech emotion recognition using transfer learning. IEICE Trans. Inf. Syst.97(9), 2530–2532 (2014).

[45]

J. Gideon, S. Khorram, Z. Aldeneh, D. Dimitriadis, E. M. Provost, Progressive neural networks for transfer learning in emotion recognition. arXiv preprint arXiv:1706.03256 (2017).

[46]

M. Tu, V. Berisha, J. Liss, Interpretable objective assessment of dysarthric speech based on deep neural networks. Proc. Interspeech 2017, 1849–1853 (2017).

[47]

C. Busso, M. Bulut, C. -C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J. N. Chang, S. Lee, S. S. Narayanan, IEMOCAP: Interactive emotional dyadic motion capture database. Lang. Resour. Eval.42(4), 335–359 (2008).

[48]

C. M. Lee, S. Yildirim, M. Bulut, A. Kazemzadeh, C. Busso, Z. Deng, S. Lee, S. Narayanan, Emotion recognition based on phoneme classes. Proc. INTERSPEECH, 205–211 (2004).

[49]

S. G. Barsade, The ripple effect: emotional contagion and its influence on group behavior. Adm. Sci. Q.47(4), 644–675 (2002).

[50]

A. Katsamanis, M. Black, P. G. Georgiou, L. Goldstein, S. Narayanan, Sailalign: Robust long speech-text alignment, (2011).

[51]

P. Boersma, Praat, a system for doing phonetics by computer. Glot international, 5 (2002).

[52]

B. Schuller, B. Vlasenko, F. Eyben, M. Wollmer, A. Stuhlsatz, A. Wendemuth, G. Rigoll, Cross-corpus acoustic emotion recognition: variances and strategies. IEEE Trans. Affect. Comput.1(2), 119–131 (2010).

[53]

B. Schuller, B. Vlasenko, F. Eyben, G. Rigoll, A. Wendemuth, in Automatic Speech Recognition & Understanding, 2009. ASRU 2009. IEEE Workshop On. Acoustic emotion recognition: a benchmark comparison of performances (IEEE, 2009), pp. 552–557.

[54]

A. Y. Ng, Feature selection, l1 vs. l2 regularization, and rotational invariance, (2004).

[55]

I. Guyon, Feature Extraction: Foundations and Applications, vol. 207, (2006).

[56]

L. -J. Li, H. Su, L. Fei-Fei, E. P. Xing, Object bank: a high-level image representation for scene classification & semantic feature sparsification. Advances in Neural Information Processing Systems, 1378–1386 (2010).

[57]

J. Gao, G. Andrew, M. Johnson, K. Toutanova, A comparative study of parameter estimation methods for statistical natural language processing. Annu. Meet.-Assoc. Comput. Linguist.45(1), 824 (2007).

[58]

S. -I. Lee, H. Lee, P. Abbeel, A. Y. Ng, Efficient L1 regularized logistic regression. Proc.Natl. Conf. Artif. Intell.21(1), 401 (2006).

[59]

H. Lee, A. Battle, R. Raina, A. Y. Ng, Efficient sparse coding algorithms. Adv. Neural Inf. Process. Syst., 801–808 (2006).

[60]

G. -X. Yuan, K. -W. Chang, C. -J. Hsieh, C. -J. Lin, A comparison of optimization methods and software for large-scale L1-regularized linear classification. J. Mach. Learn. Res.11:, 3183–3234 (2010).

[61]

M. Grant, S. Boyd, Y. Ye, (2015) CVX: Matlab software for disciplined convex programming (2008). http://stanford.edu/~boyd/cvx.

[62]

F. Eyben, A. Batliner, B. Schuller, D. Seppi, S. Steidl, Cross-corpus classification of realistic emotions some pilot experiments, (2010).

[63]

D. Neiberg, P. Laukka, H. A. Elfenbein, Intra-, inter-, and cross-cultural classification of vocal affect. Proc. INTERSPEECH, 1581–1584 (2011).

[64]

B. Zhang, E. M. Provost, G. Essl, Cross-corpus acoustic emotion recognition with multi-task learning: seeking common ground while preserving differences. IEEE Trans. Affect. Comput. (2017).

[65]

J. Kim, A. Toutios, S. Lee, S. S. Narayanan, A kinematic study of critical and non-critical articulators in emotional speech production. J. Acoust. Soc. Am.137(3), 1411–1429 (2015).

Recommendations

Application of Emotion Recognition and Modification for Emotional Telugu Speech Recognition
Abstract
Majority of the automatic speech recognition systems (ASR) are trained with neutral speech and the performance of these systems are affected due to the presence of emotional content in the speech. The recognition of these emotions in human speech ...
Emotion recognition from speech: a review

Emotion recognition from speech has emerged as an important research area in the recent past. In this regard, review of existing work on emotional speech processing is useful for carrying out further research. In this paper, the recent literature on ...
Emotion recognition from speech using global and local prosodic features

In this paper, global and local prosodic features extracted from sentence, word and syllables are proposed for speech emotion or affect recognition. In this work, duration, pitch, and energy values are used to represent the prosodic information, for ...

Comments

Information & Contributors

Information

Published In

cover image EURASIP Journal on Audio, Speech, and Music Processing

EURASIP Journal on Audio, Speech, and Music Processing Volume 2019, Issue 1

Dec 2019

399 pages

ISSN:1687-4714

EISSN:1687-4722

Issue’s Table of Contents

Copyright © 2019 The Author(s).

Publisher

Hindawi Limited

London, United Kingdom

Publication History

Published: 01 December 2019

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 13 Jan 2025

Other Metrics

View Author Metrics

Citations

View Options

View options

Media

Figures

Other

Tables

View Issue’s Table of Contents