Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Articulation constrained learning with application to speech emotion recognition

Published: 01 December 2019 Publication History

Abstract

Speech emotion recognition methods combining articulatory information with acoustic features have been previously shown to improve recognition performance. Collection of articulatory data on a large scale may not be feasible in many scenarios, thus restricting the scope and applicability of such methods. In this paper, a discriminative learning method for emotion recognition using both articulatory and acoustic information is proposed. A traditional 1-regularized logistic regression cost function is extended to include additional constraints that enforce the model to reconstruct articulatory data. This leads to sparse and interpretable representations jointly optimized for both tasks simultaneously. Furthermore, the model only requires articulatory features during training; only speech features are required for inference on out-of-sample data. Experiments are conducted to evaluate emotion recognition performance over vowels /AA/, /AE/, /IY/, /UW/ and complete utterances. Incorporating articulatory information is shown to significantly improve the performance for valence-based classification. Results obtained for within-corpus and cross-corpus categorical emotion recognition indicate that the proposed method is more effective at distinguishing happiness from other emotions.

References

[1]
P. Partila, J. Tovarek, J. Frnda, M. Voznak, M. Penhaker, T. Peterek, in Intelligent Data Analysis and Its Applications, Volume II. Emotional impact on neurological characteristics and human speech (Springer, 2014), pp. 527–533.
[2]
R. Cowie, R. R. Cornelius, Describing the emotional states that are expressed in speech. Speech Comm.40(1), 5–32 (2003).
[3]
K. R. Scherer, Vocal affect expression: a review and a model for future research. Psychol. Bull.99(2), 143 (1986).
[4]
L. Vidrascu, L. Devillers, Detection of real-life emotions in call centers. Proc. INTERSPEECH, 1841–1844 (2005).
[5]
G. I. Roisman, J. L. Tsai, K. -H. S. Chiang, The emotional integration of childhood experience: physiological, facial expressive, and self-reported emotional response during the adult attachment interview. Dev. Psychol.40(5), 776 (2004).
[6]
S. Narayanan, P. G. Georgiou, Behavioral signal processing: deriving human behavioral informatics from speech and language. Proc. IEEE. 101(5), 1203–1233 (2013).
[7]
A. Metallinou, S. Lee, S. Narayanan, Decision level combination of multiple modalities for recognition and analysis of emotional expression, (2010).
[8]
Z. Zeng, M. Pantic, G. I. Roisman, T. S. Huang, A survey of affect recognition methods: audio, visual, and spontaneous expressions. IEEE Trans. Patt. Anal. Mach. Intell.31(1), 39–58 (2009).
[9]
E. Mower, M. J. Mataric, S. Narayanan, A framework for automatic human emotion classification using emotion profiles. IEEE Trans. Audio Speech Lang. Process.19(5), 1057–1070 (2011).
[10]
Z. Aldeneh, S. Khorram, D. Dimitriadis, E. M. Provost, in Proceedings of the 19 th ACM International Conference on Multimodal Interaction. Pooling acoustic and lexical features for the prediction of valence (ACM, 2017), pp. 68–72.
[11]
D. Erickson, O. Fujimura, B. Pardo, Articulatory correlates of prosodic control: Emotion and emphasis. Lang. Speech. 41(3-4), 399–417 (1998).
[12]
M. Nordstrand, G. Svanfeldt, B. Granström, D. House, Measurements of articulatory variation in expressive speech for a set of swedish vowels. Speech Comm.44(1), 187–196 (2004).
[13]
S. Lee, S. Yildirim, A. Kazemzadeh, S. Narayanan, An articulatory study of emotional speech production. Proc. INTERSPEECH, 497–500 (2005).
[14]
T. L. Nwe, S. W. Foo, L. C. De Silva, Speech emotion recognition using hidden Markov models. Speech Commun.41(4), 603–623 (2003).
[15]
C. M. Lee, S. S. Narayanan, Toward detecting emotions in spoken dialogs. IEEE Trans. Speech Audio Process. 13(2), 293–303 (2005).
[16]
B. Schuller, A. Batliner, S. Steidl, D. Seppi, Recognising realistic emotions and affect in speech: state of the art and lessons learnt from the first challenge. Speech Commun.53(9), 1062–1087 (2011).
[17]
M. El Ayadi, M. S. Kamel, F. Karray, Survey on speech emotion recognition: features, classification schemes, and databases. Pattern Recog.44(3), 572–587 (2011).
[18]
M. Shah, C. Chakrabarti, A. Spanias, Within and cross-corpus speech emotion recognition using latent topic model-based features. EURASIP J. Audio Speech Music Process. 2015(1), 1–17 (2015).
[19]
G. Zhou, J. H. Hansen, J. F. Kaiser, Proc. IEEE Int. Conf. Acoust. Speech Signal Process.1:, 549–552 (1998).
[20]
C. E. Williams, K. N. Stevens, Emotions and speech: some acoustical correlates. J. Acoust. Soc. Am.52(4B), 1238–1250 (2005).
[21]
B. Schuller, A. Batliner, D. Seppi, S. Steidl, T. Vogt, J. Wagner, L. Devillers, L. Vidrascu, N. Amir, L. Kessous, et al., The relevance of feature type for the automatic classification of emotional user states: low level descriptors and functionals. Proc. INTERSPEECH, 1–4 (2007).
[22]
C. Busso, S. Lee, S. Narayanan, Analysis of emotionally salient aspects of fundamental frequency for emotion detection. IEEE Tran. Audio Speech Lang. Process. 17(4), 582–596 (2009).
[23]
B. Schuller, S. Steidl, A. Batliner, The INTERSPEECH 2009 emotion challenge. Proc. INTERSPEECH, 312–315 (2009).
[24]
C. -C. Lee, E. Mower, C. Busso, S. Lee, S. Narayanan, Emotion recognition using a hierarchical binary decision tree approach. Speech Commun.53(9), 1162–1171 (2011).
[25]
F. Eyben, M. Wollmer, B. Schuller, OpenEAR—introducing the Munich open-source emotion and affect recognition toolkit. Int. Conf. Affect. Comput. Intell. Interact. Workshops, 1–6 (2009).
[26]
J. Nicholson, K. Takahashi, R. Nakatsu, Emotion recognition in speech using neural networks. Neural Comput. Applic.9(4), 290–296 (2000).
[27]
O. -W. Kwon, K. Chan, J. Hao, T. -W. Lee, Emotion recognition by speech signals. Proc. INTERSPEECH (2003).
[28]
C. -C. Lee, E. Mower, C. Busso, S. Lee, S. Narayanan, Emotion recognition using a hierarchical binary decision tree approach. Speech Commun.53(9), 1162–1171 (2011).
[29]
S. Parthasarathy, R. Cowie, C. Busso, Using agreement on direction of change to build rank-based emotion classifiers. IEEE/ACM Trans. Audio Speech Lang. Process.24(11), 2108–2121 (2016).
[30]
Z. Huang, J. Epps, in Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference On. A PLLR and multi-stage staircase regression framework for speech-based emotion prediction (IEEE, 2017), pp. 5145–5149.
[31]
S. Parthasarathy, C. Busso, Jointly predicting arousal, valence and dominance with multi-task learning. INTERSPEECH Stockholm Sweden (2017).
[32]
D. Le, Z. Aldeneh, E. M. Provost, Discretized Continuous Speech Emotion Recognition with Multi-Task Deep Recurrent Neural Network. In INTERSPEECH, 1108–1112 (2017).
[33]
S. Khorram, Z. Aldeneh, D. Dimitriadis, M. McInnis, E. M. Provost, Capturing long-term temporal dependencies with convolutional networks for continuous emotion recognition. Proc. Interspeech 2017, 1253–1257 (2017).
[34]
Z. Aldeneh, E. M. Provost, in Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference On. Using regional saliency for speech emotion recognition (IEEE, 2017), pp. 2741–2745.
[35]
S. Mirsamadi, E. Barsoum, C. Zhang, in Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference On. Automatic speech emotion recognition using recurrent neural networks with local attention (IEEE, 2017), pp. 2227–2231.
[36]
M. Shah, C. Chakrabarti, A. Spanias, A multi-modal approach to emotion recognition using undirected topic models, (2014).
[37]
P. K. Ghosh, S. Narayanan, A generalized smoothness criterion for acoustic-to-articulatory inversion. J. Acoust. Soc. Am.128(4), 2162–2172 (2010).
[38]
J. Kim, P. Ghosh, S. Lee, S. S. Narayanan, in Signal & Information Processing Association Annual Summit and Conference (APSIPA ASC), 2012 Asia-Pacific. A study of emotional information present in articulatory movements estimated using acoustic-to-articulatory inversion (IEEE, 2012), pp. 1–4.
[39]
P. K. Ghosh, S. Narayanan, Automatic speech recognition using articulatory features from subject-independent acoustic-to-articulatory inversion. J. Acoust. Soc. Am.130(4), 251–257 (2011).
[40]
L. Badino, C. Canevari, L. Fadiga, G. Metta, Integrating articulatory data in deep neural network-based acoustic modeling. Comput. Speech Lang.36:, 173–195 (2016).
[41]
M. Li, J. Kim, A. Lammert, P. K. Ghosh, V. Ramanarayanan, S. Narayanan, Speaker verification based on the fusion of speech acoustics and inverted articulatory signals. Comput. Speech Lang.36:, 196–211 (2016).
[42]
D. Ververidis, C. Kotropoulos, in Proc. Panhellenic Conference on Informatics (PCI). A review of emotional speech databases, (2003), pp. 560–574.
[43]
J. Deng, Z. Zhang, E. Marchi, B. Schuller, in Affective Computing and Intelligent Interaction (ACII), 2013 Humaine Association Conference On. Sparse autoencoder-based feature transfer learning for speech emotion recognition (IEEE, 2013), pp. 511–516.
[44]
P. Song, Y. Jin, L. Zhao, M. Xin, Speech emotion recognition using transfer learning. IEICE Trans. Inf. Syst.97(9), 2530–2532 (2014).
[45]
J. Gideon, S. Khorram, Z. Aldeneh, D. Dimitriadis, E. M. Provost, Progressive neural networks for transfer learning in emotion recognition. arXiv preprint arXiv:1706.03256 (2017).
[46]
M. Tu, V. Berisha, J. Liss, Interpretable objective assessment of dysarthric speech based on deep neural networks. Proc. Interspeech 2017, 1849–1853 (2017).
[47]
C. Busso, M. Bulut, C. -C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J. N. Chang, S. Lee, S. S. Narayanan, IEMOCAP: Interactive emotional dyadic motion capture database. Lang. Resour. Eval.42(4), 335–359 (2008).
[48]
C. M. Lee, S. Yildirim, M. Bulut, A. Kazemzadeh, C. Busso, Z. Deng, S. Lee, S. Narayanan, Emotion recognition based on phoneme classes. Proc. INTERSPEECH, 205–211 (2004).
[49]
S. G. Barsade, The ripple effect: emotional contagion and its influence on group behavior. Adm. Sci. Q.47(4), 644–675 (2002).
[50]
A. Katsamanis, M. Black, P. G. Georgiou, L. Goldstein, S. Narayanan, Sailalign: Robust long speech-text alignment, (2011).
[51]
P. Boersma, Praat, a system for doing phonetics by computer. Glot international, 5 (2002).
[52]
B. Schuller, B. Vlasenko, F. Eyben, M. Wollmer, A. Stuhlsatz, A. Wendemuth, G. Rigoll, Cross-corpus acoustic emotion recognition: variances and strategies. IEEE Trans. Affect. Comput.1(2), 119–131 (2010).
[53]
B. Schuller, B. Vlasenko, F. Eyben, G. Rigoll, A. Wendemuth, in Automatic Speech Recognition & Understanding, 2009. ASRU 2009. IEEE Workshop On. Acoustic emotion recognition: a benchmark comparison of performances (IEEE, 2009), pp. 552–557.
[54]
A. Y. Ng, Feature selection, l1 vs. l2 regularization, and rotational invariance, (2004).
[55]
I. Guyon, Feature Extraction: Foundations and Applications, vol. 207, (2006).
[56]
L. -J. Li, H. Su, L. Fei-Fei, E. P. Xing, Object bank: a high-level image representation for scene classification & semantic feature sparsification. Advances in Neural Information Processing Systems, 1378–1386 (2010).
[57]
J. Gao, G. Andrew, M. Johnson, K. Toutanova, A comparative study of parameter estimation methods for statistical natural language processing. Annu. Meet.-Assoc. Comput. Linguist.45(1), 824 (2007).
[58]
S. -I. Lee, H. Lee, P. Abbeel, A. Y. Ng, Efficient L1 regularized logistic regression. Proc.Natl. Conf. Artif. Intell.21(1), 401 (2006).
[59]
H. Lee, A. Battle, R. Raina, A. Y. Ng, Efficient sparse coding algorithms. Adv. Neural Inf. Process. Syst., 801–808 (2006).
[60]
G. -X. Yuan, K. -W. Chang, C. -J. Hsieh, C. -J. Lin, A comparison of optimization methods and software for large-scale L1-regularized linear classification. J. Mach. Learn. Res.11:, 3183–3234 (2010).
[61]
M. Grant, S. Boyd, Y. Ye, (2015) CVX: Matlab software for disciplined convex programming (2008). http://stanford.edu/~boyd/cvx.
[62]
F. Eyben, A. Batliner, B. Schuller, D. Seppi, S. Steidl, Cross-corpus classification of realistic emotions some pilot experiments, (2010).
[63]
D. Neiberg, P. Laukka, H. A. Elfenbein, Intra-, inter-, and cross-cultural classification of vocal affect. Proc. INTERSPEECH, 1581–1584 (2011).
[64]
B. Zhang, E. M. Provost, G. Essl, Cross-corpus acoustic emotion recognition with multi-task learning: seeking common ground while preserving differences. IEEE Trans. Affect. Comput. (2017).
[65]
J. Kim, A. Toutios, S. Lee, S. S. Narayanan, A kinematic study of critical and non-critical articulators in emotional speech production. J. Acoust. Soc. Am.137(3), 1411–1429 (2015).

Recommendations

Comments

Information & Contributors

Information

Published In

cover image EURASIP Journal on Audio, Speech, and Music Processing
EURASIP Journal on Audio, Speech, and Music Processing  Volume 2019, Issue 1
Dec 2019
399 pages
ISSN:1687-4714
EISSN:1687-4722
Issue’s Table of Contents

Publisher

Hindawi Limited

London, United Kingdom

Publication History

Published: 01 December 2019

Author Tags

  1. Emotion recognition
  2. Articulation
  3. Constrained optimization
  4. Cross-corpus

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 0
    Total Downloads
  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 13 Jan 2025

Other Metrics

Citations

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media