Towards Domain-Specific Cross-Corpus Speech Emotion Recognition Approach
Abstract
Cross-corpus speech emotion recognition (SER) poses a challenge due to feature distribution mismatch, potentially degrading the performance of established SER methods. In this paper, we tackle this challenge by proposing a novel transfer subspace learning method called acoustic knowledge-guided transfer linear regression (AKTLR). Unlike existing approaches, which often overlook domain-specific knowledge related to SER and simply treat cross-corpus SER as a generic transfer learning task, our AKITR method is built upon a well-designed acoustic knowledge-guided dual sparsity constraint mechanism. This mechanism emphasizes the potential of minimalistic acoustic parameter feature sets to alleviate classifier over-adaptation, which is empirically validated acoustic knowledge in SER, enabling superior generalization in cross-corpus SER tasks compared to using large feature sets. Through this mechanism, we extend a simple transfer linear regression model to AKTLR. This extension harnesses its full capability to seek emotion-discriminative and corpus-invariant features from established acoustic parameter feature sets used for describing speech signals across two scales: contributive acoustic parameter groups and constituent elements within each contributive group. Our proposed method is evaluated through extensive cross-corpus SER experiments on three widely-used speech emotion corpora: EmoDB, eNTERFACE, and CASIA. The results confirm the effectiveness and superior performance of our method, outperforming recent state-of-the-art transfer subspace learning and deep transfer learning-based cross-corpus SER methods. Furthermore, our work provides experimental evidence supporting the feasibility and superiority of incorporating domain-specific knowledge into the transfer learning model to address cross-corpus SER tasks.
Index Terms:
Cross-corpus speech emotion recognition, speech emotion recognition, transfer subspace learning, domain adaptation, domain-specific knowledge.I Introduction
Speech plays a crucial role in human daily communication, serving as a natural means for individuals to express their emotions such as Happiness, Fear, and Sadness. As a result, the research of speech emotion recognition (SER) [1, 2, 3], which seeks to empower computers to automatically understand emotional states from speech signals, holds significant practical value. Over the past few decades, SER has garnered substantial attention within the communities of human-computer interaction, affective computing, and signal processing, leading to the development of numerous well-performing SER methods [4, 5, 6, 7, 8, 9].
However, it is important to note that most established SER methods, including those mentioned above, primarily focus on an ideal scenario where the training and testing speech signals belong to the same speech emotion corpus. In practical situations, the testing speech signals may differ significantly from the training speech signals, exhibiting variations in numerous factors, such as languages, recording equipment, and environmental conditions. This gives rise to a challenging but intriguing task known as cross-corpus SER [10] within the field of SER. In cross-corpus SER tasks, the training and testing speech signals originate from different speech emotion corpora and can be referred to as the source and target signals, respectively. Moreover, while we have access to ground truth emotion labels for the source speech samples, the target speech emotion corpus remains entirely unlabeled.
In the early stages, the research of cross-corpus SER mostly focus on feature engineering, aiming to enhance the corpus-invariant ability of acoustic parameter feature sets used to describe speech signals. For example, in the work of [10], three feature normalization schemes, including corpus normalization, speaker normalization, and speaker-corpus normalization, are designed to address feature distribution mismatches between source and target speech emotion corpora. Subsequently, Parlak et al. [11] attempt to use numerous feature selectors, such as linear forward selection, to seek high-quality speech features that are robust to corpus variance from existing comprehensive acoustic feature sets. In recent years, inspired by the tremendous success of transfer learning in various cross-domain recognition tasks [12, 13], researchers have shifted their focus to the development of transfer learning methods for cross-corpus SER. These methods have achieved promising performance in recognizing emotions in speech signals across different corpora, marking a significant advancement in this field.
Broadly, current transfer learning-based cross-corpus SER methods can be classfied into two types, including Transfer Subspace Learning and Deep Transfer Learning:
(1) Transfer subspace learning-based cross-corpus SER methods typically begin by using a set of acoustic low-level descriptors (LLDs), such as fundamental frequency (F0) and Mel-frequency cepstral coefficients (MFCC), along with their associated functions, such as maximal and mean values, to describe the source and target speech signals. Subsequently, a transfer subspace learning model is developed to mitigate the distribution mismatch between the two feature sets. One early method can be traced back to the work of [14], in which Hassan et al. extend the support vector machine (SVM) [15] to an importance-weighted SVM (IW-SVM) for cross-corpus SER. IW-SVM incorporates three different transfer subspace learning models: kernel mean matching (KMM) [16], unconstrained least-squares importance fitting (uLSIF) [17], and Kullback-Leibler importance estimation procedure (KLIEP) [18]. It is hence enabled to learn a set of weights for the source speech samples, ensuring that the weighted source speech feature sets align with the distribution of target speech feature sets. Another notable work is the transfer non-negative matrix factorization (TNNMF) models designed by Song et al. [19]. These models integrate the maximum mean discrepancy (MMD) [20] to measure and minimize the discrepancies between the source and target speech feature distributions. Following this work, Luo et al. [21] further advance TNNMF model by jointly reducing the marginal and class-aware conditional feature distribution gaps between the two different speech sample sets.
(2) In contrast to transfer subspace learning, deep transfer learning methods often utilize the speech spectrums of the original speech signals as input for deep neural networks, harnessing their powerful nonlinear representation capabilities to learn emotion-discriminative and corpus-invariant features. Parry et al. [22] examine the generalization capacity of deep neural networks for cross-corpus SER across six different speech emotion corpora. Their experimental results demonstrate that convolutional neural networks (CNNs) [23] exhibit superior generalisation capabilities compared to recurrent neural networks (RNNs) [24]. Insipired by this observation, Zhao et al. [25] propose deep transductive transfer regression networks (DTTRN) based on CNN architectures. A key contribution of DTTRN is the incorporation of additional fine-grained emotion class-aware conditional MMD, which aids in better bridging the distribution gap between learned source and target features compared to the original MMD. Additionally, Zhao et al. [26] introduce another CNN-based deep transfer learning method called deep implicit distribution alignment neural networks (DIDAN). DIDAN performs implicit distribution alignment for source and target speech corpora by replacing the minimization of MMD with sparsely reconstructing target samples using source samples. More recently, domain-adversarial learning-based models [27, 28, 29] have been developed to learn more generalized representations of speech signals for cross-corpus SER. The key concept behind these methods are the introduction of an additional domain (corpus) classifier, which enables the deep neural networks to learn the generalized features to describe speech signals, regardless of their corpus sources.
While both transfer subspace learning and deep transfer learning methods have demonstrated success in addressing the challenge of cross-corpus SER, it is worth noting that these methods often approach cross-corpus SER as a generic transfer learning task. This means that most of these methods focus primarily on developing transfer learning models without specifically considering the valuable acoustic knowledge inherent to SER. As a result, these transfer learning methods can be applied to other cross-domain recognition tasks without making significant modifications. According to the ”No Free Lunch Theorem” [30], it is established that ”There is no universal learning algorithm that can provide the best solution for every problem. Each algorithm has its strengths and weaknesses, and its performance is highly dependent on the specific problem domain and data distribution.” From this perspective, it can be argued that they may not offer ultimately satisfactory solutions for cross-corpus SER. In other words, incorporating domain-specific knowledge from SER to guide the design of transfer learning models could potentially lead to even better performance compared to the generic transfer learning models when dealing with cross-corpus SER. Therefore, our goal in this paper is to develop a domain-specific transfer learning approach for cross-corpus SER. Specifically, we propose a novel transfer subspace learning method called acoustic knowledge-guided transfer linear regression (AKTLR).
The basic concept behind AKTLR comes from the empirically validated acoustic knowledge about the acoustic parameter feature sets designed for describing speech signals and their cross-corpus recognition performance evaluation within SER [31, 11, 32]. These works inform us that selectively minimalistic high-quality acoustic parameters are more capable of exhibiting superior generalization ability to variance in speech emotion corpus.Therefore, selecting these acoustic parameters may enable the transfer subspace learning models to achieve more promising recognition performance in cross-corpus SER tasks compared to directly using larger feature sets comprising comprehensive acoustic parameters. This insight motivates us to introduce an acoustic knowledge-guided dual sparsity constraint mechanism, illustrated in Fig. 1, to develop AKTLR model for cross-corpus SER. As depicted in Fig. 1, this mechanism equips the AKTLR model to proficiently discern emotion-discriminative and corpus-invariant features from established acoustic parameter feature sets at both coarse-grained and fine-grained scales. Specifically, it begins by measuring the contribution scores of different acoustic LLDs, and subsequently selects truly contributive derived features from each of LLD groups with high contribution scores.
To evaluate the effectiveness of AKTLR, we conduct extensive cross-corpus SER experiments using three widely-used speech emotion corpora: EmoDB [33], eNTERFACE [34], and CASIA [35]. The experimental results demonstrate that our AKTLR outperforms recent state-of-the-art transfer subspace learning and deep transfer learning-based cross-corpus SER methods, showcasing the effectiveness of incorporating domain-specific knowledge into transfer subspace learning for cross-corpus SER. In summary, this paper makes three primary contributions:
-
1.
We propose AKTLR, a novel transfer subspace learning method inspired by empirically verified acoustic knowledge, making it the first work to propose a domain-specific approach for cross-corpus SER.
-
2.
We introduce an acoustic knowledge-guided dual sparsity constraint mechanism to guide the design of AKTLR. This mechanism enables AKTLR to effectively seek emotion-discriminative and corpus-invariant features from established acoustic parameter feature sets, operating at two different scales, for cross-corpus SER.
-
3.
We perform extensive cross-corpus SER experiments using three widely-used speech emotion corpora to assess the effectiveness of AKTLR. The experimental results demonstrate the superior performance of AKTLR in addressing the challenge of cross-corpus SER.
II Proposed Method
II-A Notations
In this section, we will provide a detailed description of the proposed AKTLR model and demonsrtate how to utilize this model to address cross-corpus SER tasks. Before delving into the model specifics, let us establish a set of notations necessary for constructing the model. Suppose we have a source speech emotion corpus comprising samples, with its feature matrix denoted as . Here, represents the dimension of acoustic parameter feature vector and represents the feature dimension. represents the features derived from a LLD group (one specific LLD or more closely-related LLDs), such as MFCC or energy-based features, within LLD groups used to design acoustic parameter features for describing speech signals. The corresponding emotion label matrix of the source speech samples is expressed as . Each column is a one-hot vector associated with the speech sample. The entry in is set as 1 if the corresponding speech sample expresses the emotion within emotion set , and 0 otherwise. Similarly, the target speech feature matrix can be denoted as , where is number of samples in the target speech emotion corpus.
II-B AKTLR Model
As previously described and illustrated in Fig. 1, our AKTLR method is designed based on a simple transfer linear regression model and an acoustic knowledge-guided dual sparsity constraint mechanism. This design enables the model to effectively seek high-quality speech features that are emotion-discriminative and corpus-invariant at two scales within a comprehensive feature set consisting of various acoustic parameters and their derived features. This facilitates the connection of emotions expressed in speech signals from different corpora. To achieve this, we design the following optimization problem for AKTLR:
(1) |
In Eq.(1), and represent the loss functions corresponding to a simple Transfer Linear Regression model and newly designed Acoustic Knowledge-guided Dual Sparsity Constraint Mechanism, respectively. The parameter serves as a trade-off parameter that controls the balance between these two functions. It is important to note that is the regression coefficient matrix to be learned in AKTLR, and is a contribution score vector and also a model parameter of AKTLR. Each entry in this vector, , is a non-negative value and measures the contribution of its corresponding acoustic parameter feature derived from a LLD group, in recognizing emotions across speech corpora. In what follows, we describe the details of the key loss functions in AKTLR.
II-B1 Loss Function for Transfer Linear Regression
The loss function corresponding to transfer linear regression, denoted as , can be formulated as follows:
(2) |
where , is the mean difference between the source and target speech feature vectors associated with the LLD group, and is the trade-off parameter.
The loss function consists of two main terms. The first term, , represents a weighted linear regression function that establishes the relationship between the source speech feature sets and their ground truth emotion labels. Minimizing this term enables the proposed AKTLR to seek a subspace to distinguish different emotions expressed in speech signals. The second term, , measures the distribution gap between the source and target speech feature sets in such subspace using the one-order statistical moment, the mean value. Minimizing this term encourages the source and target feature sets to have similar feature distributions in such subspace. Thus, the proposed AKTLR is also applicable to distinguish emotions expressed in target speech signals.
II-B2 Loss Function for Acoustic Knowledge-guided Dual Sparsity Constraint Mechanism
The loss function for the acoustic knowledge-guided dual sparsity constraint mechanism is designed as follows:
(3) |
Here, is the trade-off parameter. This loss function consists of two major terms: the norm with respect to and the norm with respect to . Minimizing this loss function enforces the proposed AKTLR to learn a non-negative sparse and column-sparse . The non-negative sparse allows the AKTLR model to measure the specific contributions of different acoustic parameter features at a coarse-grained scale of LLD group, suppressing the less-contributive ones, while highlighting highly-contributive ones. Additionally, the column-sparse further enhance AKTLR by performing fine-grained feature selection to suppress low-quality acoustic parameter features derived from LLD groups with high contribution scores.
II-B3 Optimization Problem of AKTLR
By incorporating the formulations of the two loss functions as shown in Eqs.(2) and (3) into Eq.(1), we can derive the ultimate optimization problem for training the proposed AKTLR models, which is expressed as follows:
(4) |
where , , and are the trade-off parameters that control the balance among the key terms in the total loss function of AKTLR.
II-C Optimization of AKTLR
The optimization problem for training AKTLR, as shown in Eq.(4), can be effectively addressed using the alternated direction method (ADM) [37]. Specifically, the optimal parameters in AKTLR, represented by and , can be obtained through the following iterative steps:
(1) Fix and Update : In this step, the optimization problem becomes one with respect to , which can be formulated as follows:
(8) |
Let and , where is a vector of all zero values. Then, the optimization problem in Eq.(8) can be rewritten as:
(9) |
Subsequently, let and , where is an operation that reshapes a matrix into a vector column by column. We are thus able to further restate the objective function in Eq.(9) as the following formulation:
(10) |
where . It is apparent that Eq.(10) represents a standard non-negative LASSO problem, and we utilize the SLEP package [38] to solve it.
(2) Fix and Update : The optimization problem in this step can be formulated as follows:
(11) |
where . We use the inexact augmented Lagrangian multiplier (IALM) [39] to learn the optimal .
To be specific, an additional variable, satisifying , is introduced to first convert the original unconstrained optimization problem in Eq.(11) to a constrained one, which can be expressed as follows:
(12) |
Subsequently, we are able to obtain the Lagrangian function for Eq.(12), which is formulated as follows:
(13) |
where is the Lagrangian multiplier matrix, represents the trace of a square matrix, and is a relaxation factor.
Finally, the optimal solution of can be obtained by iteratively minimizing the Lagrangian function in Eq.(13) with respect to one of variables while fixing the others. The detailed updating procedures are summarized in Algorithm 1.
(3) Check Convergence: the value of objective function is less than the machine epsilon value or that the iteration reaches the preset maximal number.
II-D Prediction of Emotion Labels for Target Speech Signals
Once we have obtained the optimal solution, and , for AKTLR, we can easily predict the emotion labels of the target speech signals. Let be the feature vector of a target speech sample. We first predict its emotion label vector by solving the following optimization problem:
(14) |
This is a standard quadratic programming problem and can be effectively solved using the interior point method. Then, based on , the emotion label of its corresponding target speech signal can be determined as the emotion, which satisfies the following criterion:
(15) |
where represents the entry in the predcted emotion label vector .
III Experiments
III-A Experiment Setup
In this section, we evaluate the performance of the proposed AKTLR method through extensive cross-corpus SER experiments. We provide details of our experiment setup, including: 1) Speech Emotion Corpora, 2) Experimental Protocol, 3) Performance Metric, and 4) Comparison Methods and Implementation Details.
III-A1 Speech Emotion Corpora
We utilize three publicly available speech emotion corpora in our experiments. Here is a brief overview of these corpora:
EmoDB [33]: This German speech emotion corpus consists of 535 speech samples. Each sample corresponds to a sentence uttered in German under one of seven emotional states (Anger, Boredom, Disgust, Fear, Happiness, Neutral, and Sadness) by one of 10 professional German actresses/actors (five actresses and five actors).
eNTERFACE [34]: Unlike EmoDB, eNTERFACE is a bimodal emotion database containing 1,257 video clips with both speech and facial expressions. Each video clip is labeled with one of six basic emotions (Anger, Disgust, Fear, Happiness, Sadness, and Surprise). For the design of our cross-corpus SER tasks, only the speech data is used.
CASIA [35]: This is a large-scale Chinese speech emotion corpus comprising 9,600 speech samples. In our experiments, we utilize its freely released version, which includes 1,200 speech samples from four speakers (two females and two males), with each speech sample conveying one of six different emotions (Anger, Fear, Happiness, Neutral, Sadness, and Surprise).
III-A2 Experimental Protocol
We used the aforementioned three speech emotion corpora to create six cross-corpus SER tasks: , , , , , and . Here, , , and represent EmoDB, eNTERFACE, and CASIA, respectively. The corpora listed on either side of the arrow indicate the source and target speech emotion corpora in their respective cross-corpus SER tasks. It is important to note that due to inconsistencies in emotion labels across the three speech emotion corpora, only speech samples with matching emotion labels are chosen for their corresponding tasks. For a more comprehensive understanding of these cross-corpus SER tasks, detailed data composition for all the speech emotion corpora is presented in Table I.
Cross-Corpus SER Task | BE / EB | BC / CB | EC / CE | ||||
EmoDB | eNTERFACE | EmoDB | CASIA | eNTERFACE | CASIA | ||
Sample Number | Anger | 127 | 211 | 127 | 200 | 211 | 200 |
Fear | 69 | 211 | 69 | 200 | 211 | 200 | |
Disgust | 46 | 211 | - | - | - | - | |
Happiness | 71 | 208 | 71 | 200 | 208 | 200 | |
Neutral | - | - | 79 | 200 | - | - | |
Sadness | 62 | 211 | 62 | 200 | 211 | 200 | |
Surprise | - | - | - | - | 211 | 200 | |
Total Number | 375 | 1,052 | 408 | 1,000 | 1,052 | 1,000 |
III-A3 Performance Metric
We have chosen the unweighted average recall (UAR) [10] as the performance metric for our experiments. UAR is computed by averaging the accuracy across the total number of emotion classes. It is calculated using the formula UAR . Here, is the number of total emotion classes involved in the cross-corpus SER task, and and represent the number of samples predicted as the emotion and the actual number of emotion samples, respectively.
III-A4 Comparison Methods and Implementation Details
To highlight the effectiveness and superior performance of our AKTLR method in addressing the challenge of cross-corpus SER, we compare it with five recent state-of-the-art (SOTA) Transfer Subspace Learning methods and six SOTA Deep Transfer Learning methods. The methods included in the comparison and their implementation details are as follows:
Transfer Subspace Learning Methods include transfer component analysis (TCA) [40], geodesic flow kernel (GFK) [41], subspace alignment (SA) [42], domain-adaptive subspace learning (DoSL) [43], and joint distribution adaptive regression (JDAR) [44]. In these methods, two widely-used acoustic parameter feature sets, namely INTERSPEECH 2009 Emotion Challenge (IS09) [45] and the extended Geneva minimalistic acoustic parameter set (eGeMAPS) [32], are utilized to describe speech signals. Both feature sets consist of low-level descriptors (LLDs) such as F0 and MFCC through typical statistical functions. The openSMILE toolkit [46] is used to extract these feature sets from the speech signals. For the experiments, linear support vector machine (SVM) [47] is used as the classifier for all subspace learning methods without classification ability, including TCA, GFK, and SA. Additionally, the results of directly using SVM to conduct all cross-corpus SER experiments are included as the baseline.
Since emotion label information is unavailable in the tasks of cross-corpus SER, we follow the tradition of transfer learning evaluation. Therefore, we report the best results of the five transfer subspace learning methods by searching their hyper-parameters from a given interval. Specifically, TCA, GFK, and SA aim to learn a -dimensional common subspace for both source and target speech samples, where is set within a predetermined parameter interval, , and represents the number of elements in the acoustic parameter set used in the experiments. DoSL and JDAR require setting two trade-off parameters, and , which control the balance between the sparsity and feature distribution elimination terms and the original regression loss function. In the experiments, and are determined by searching within the range of .
Deep Transfer Learning Methods including deep adaptation network (DAN) [48], joint adaptation network (JAN) [49], deep subdomain adaptation network (DSAN) [50], domain-adversarial neural network (DANN) [51], conditional domain adversarial network (CDAN) [52], and DIDAN [26], are utilized in the comparison experiments. The speech signals are first tranformed into the Mel-spectrograms and then resized to pixels, serving as the input for deep neural networks. In this comparison, VGG-11 [53] is chosen as the CNN backbone of all the deep transfer learning methods, and its experimental results are included as the baseline. The optimizer, learning rate, weight decay, and batch size are set as SGD, , , and , respectively, for the VGG-11 and comparison deep transfer learning methods. The trade-off parameter settings for all deep transfer learning methods are as follows:
DAN, JAN, DSAN, DANN, and CDAN have a trade-off parameter in their loss functions, which balances the original loss function and the feature distribution alleviation term. In the experiments, is searched within the parameter interval . Besides , DIDAN has an additional trade-off parameter, , which controls the sparsity of its learned reconstruction coefficient matrix. For DIDAN, and are also searched within the same intervals as the other five deep transfer learning methods: .
Our AKTLR has three trade-off parameters: and . In our experments, we conduct a search for and in the parameter interval of , while is searched winthin the range of . Additionally, we divide both IS09 and eGeMAPS feature sets into 10 LLD groups based on the acoustic parameter type. For further details, please refer to Table II.
Feature Set | LLD Groups |
IS09 | ZCR (12), ZCR (12), F0 (12), F0 (12), |
RMS Energy (12), RMS Energy (12), HNR (12), | |
HNR (12), MFCC (144), MFCC (144) | |
eGeMAPS | F0 (18), Loudness (16), Spectral Flux (5), |
Formant (18), Hammarberg Index (3), MFCC (16), | |
Spectral Slope (6), Alpha Ratio (3), HNR (2), | |
Equivalent Sound Level (1) |
Method | Average | |||||||
Subspace Learning (IS09 Feature Set) | SVM | 28.93 | 23.58 | 29.60 | 35.01 | 26.10 | 25.14 | 28.06 |
TCA | 30.73 | 45.16 | 33.40 | 45.82 | 31.80 | 34.12 | 36.84 | |
GFK | 32.40 | 45.42 | 35.60 | 51.19 | 32.90 | 29.54 | 37.84 | |
SA | 33.50 | 45.78 | 36.90 | 48.48 | 32.80 | 32.71 | 38.36 | |
DoSL | 36.29 | 39.84 | 34.60 | 46.14 | 30.90 | 31.69 | 36.58 | |
JDAR | 37.10 | 40.78 | 33.10 | 47.34 | 32.40 | 31.50 | 37.04 | |
Subspace Learning (eGeMAPS Feautre Set) | SVM | 25.65 | 32.58 | 33.50 | 51.84 | 36.40 | 34.79 | 35.96 |
TCA | 31.09 | 37.43 | 42.90 | 53.43 | 41.10 | 35.90 | 40.31 | |
GFK | 30.08 | 35.79 | 40.00 | 50.79 | 39.20 | 34.48 | 38.39 | |
SA | 32.18 | 39.37 | 38.80 | 53.20 | 37.00 | 35.43 | 39.33 | |
DoSL | 30.81 | 40.71 | 39.30 | 52.21 | 39.10 | 34.27 | 39.40 | |
JDAR | 31.41 | 45.19 | 42.30 | 56.14 | 38.40 | 33.62 | 41.18 | |
Deep Learning | VGG-11 | 27.08 | 34.83 | 34.80 | 51.31 | 26.90 | 26.02 | 33.49 |
DAN | 33.58 | 43.50 | 36.30 | 56.72 | 29.30 | 32.17 | 38.60 | |
JAN | 35.23 | 47.29 | 37.00 | 57.51 | 31.00 | 32.21 | 40.04 | |
DSAN | 31.82 | 47.58 | 35.58 | 56.50 | 29.00 | 31.25 | 38.66 | |
DANN | 32.56 | 46.06 | 36.40 | 57.67 | 30.50 | 33.77 | 39.49 | |
CDAN | 31.62 | 46.12 | 35.40 | 57.60 | 30.30 | 33.49 | 39.09 | |
DIDAN | 33.05 | 47.11 | 38.90 | 56.22 | 31.10 | 34.06 | 40.07 | |
Subspace Learning | AKTLR (IS09) | 37.51 | 47.12 | 37.00 | 47.61 | 30.60 | 33.11 | 38.83 |
AKTLR (eGeMAPS) | 32.51 | 43.60 | 45.00 | 59.93 | 37.60 | 34.09 | 42.12 |
III-B Comparison with State-of-the-Art Cross-Corpus SER Methods
The experimental results for all transfer learning methods are presented in Table III. Several noteworthy observations can be made from this table:
(1) It is evident from Table III that both transfer subspace learning and deep transfer learning methods exhibit promising performance improvements compared to their respective baseline methods (SVM or VGG-11) in all six cross-corpus SER tasks. Particularly interesting is the consistent enhancement observed in transfer subspace methods, regardless of the choice of acoustic parameter feature sets (IS09 or eGeMAPS) used to describe speech signals. In summary, our experimental results strongly indicate the potential of transfer learning as a promising approach to effectively address the challenge of cross-corpus SER.
(2) The performance comparison of transfer subspace learning methods using the IS09 (16 LLDs yielding 384 features) and eGeMAPS feature sets (five meticulously chosen LLDs yielding 88 features) reveals that the eGeMAPS feature set significantly improves cross-corpus SER performance compared to IS09. This finding underscores the importance of selecting minimalistic high-quality acoustic parameters capable of exhibiting superior generalization ability to corpus invariance when employing transfer subspace learning methods to address cross-corpus SER tasks. Our results provide additional experimental evidence to support this established knowledge in SER [31, 11, 32], which motivates the design of our AKTLR method.
(3) As shown in the table, our AKTLR, utilizing the eGeMAPS feature set, achieves the highest UAR among all transfer learning methods, averaging a UAR of across the six cross-corpus SER tasks. Furthermore, our AKTLR outperforms all other methods in two out of the six tasks, namely and . While AKTLR may not achieve the best performance in the remaining four tasks, it still demonstrates a very competitive performance compared to all other transfer learning methods. In summary, these observations highlight the superior performance of our AKTLR method in addressing the challenge of cross-corpus SER, surpassing both recent SOTA transfer subspace learning and deep transfer learning methods. This also demonstrates the feasibility and superiority of incorporating acoustic knowledge to develop a domain-specific cross-corpus SER approach for dealing with cross-corpus SER tasks.
#LLD Groups | Details of LLD Groups |
4 Groups | Frequency (30), Energy (20), |
Spectral (37), Equivalent Sound Level (1) | |
13 Groups | F0 (10), Jitter (2), Formant (18), Spectral Slope (6), |
MFCC (16), Alpha Ratio (3), Shimmer (2), | |
Hammarberg (3), HNR (2), Harmonic Difference (4), | |
Spectral Flux (5), MFCC (16), Londness (16), | |
Equivalent Sound Level (1) |
III-C A Deeper Look at the Proposed AKTLR Method
This section aims to provide a comprehensive understanding of the proposed AKTLR method. We will address three key questions to delve into AKTLR: 1) Does AKTLR truly benefit from the incorporation of the selected acoustic knowledge? 2) What can AKTLR learn guided by the selected acoustic knowledge? 3) How does the performance of AKTLR vary with changes in the trade-off parameter?. To answer these questions, we will conduct additional cross-corpus SER experiments using AKTLR, with the aim of offering comprehensive insights into its effectiveness and advantages.
Method | |||
AKTLR w/o (No Group) | 30.08 | 39.90 | 39.10 |
AKTLR (4 Groups) | 33.29 | 43.40 | 39.70 |
AKTLR (10 Groups) | 32.51 | 45.00 | 37.60 |
AKTLR (13 Groups) | 32.51 | 42.50 | 37.60 |
III-C1 Does AKTLR Truly Benefit From the Incorporation of the Selected Acoustic Knowledge?
To address this question, we conduct additional experiments on three representative cross-corpus SER tasks: , , and . Specifically, we utilize the eGeMPAS feature set to describe speech signals, which is divided into two additional LLD groups: and for AKTLR, different from the previous experiments where . The detailed configuration of LLD group settings can be found in Table IV. In these experiments, we also remove the regularization term from the objective function of AKTLR, resulting in a reduced version of AKTLR that alighs with the objective function of DoSL [43]. Thus, this reduced version can be viewed as AKTLR without specially considering the different contributions of LLDs, denoted as AKTLR w/o (No Group). The experimental results, presented in Table V, reveal several interesting observations that provide an experimental answer to this question.
Firstly, it is evident that our AKTLR models, which adopt different LLD group settings, achieve better performance in terms of UAR compared to AKTLR without setting LLD groups. This observation demonstrates the feasibility and superiority of the concept behind our proposed AKTLR, i.e., ”selecting these acoustic parameters may enable the transfer subspace learning models to achieve more promising recognition performance in cross-corpus SER tasks compared to directly using larger feature sets comprising comprehensive acoustic parameters”. Guided by this acoustic knowledge, AKTLR divides the acoustic parameter feature set into different LLD groups and measures their contribution scores, ensuring the learning of both emotion-discriminative and corpus-invariant features.
Secondly, it is worth noting that the AKTLR models with 10 and 13 groups perform worse than AKTLR without without setting LLD groups in the task of . We believe that this is mainly due to the use of an excessive LLD groups in these cases. By comparing the different groups used for various cross-corpus SER tasks, it becomes apparent that the overall performance of AKTLR decreases with an increase in the number of groups. This supports our previous supposition. In other words, determining a suitable LLD group setting remains an open question for our AKTLR method in tackling the challenge of cross-corpus SER.
III-C2 What Can AKTLR Learn When Guided by the Selected Acoustic Knowledge?
Our proposed AKTLR benefits from the incorporation of established acoustic knowledge into its design. By dividing the acoustic parameter feature set into different LLD groups and measuring their contribution scores, AKTLR model is more capable of seeking a minimalistic high-quality features that are emotion-discriminative features and corpus-invariant. This approach inspires us to explore what AKTLR can learn when guided by the utilization of acoustic knowledge. To this end, we present a set of bar charts in Fig. 2, illustrating the values learned by AKTLR when utilizing the eGeMAPS feature set with different LLD groups to address three representative cross-corpus SER experiments in Table V.
The findings from Fig. 2 are quite intriguing. Firstly, it is evident that different LLD groups exhibit varying contributions when addressing cross-corpus SER tasks. Specifically, in five out of the nine cross-corpus SER experiments, certain acoustic parameters (corresponding to 0-valued ) show negligible contribution in distinguishing emotions across speech corpora. These observations provide experimental evidence that supports selected acoustic knowledge guiding the design of the proposed AKTLR [31, 11, 32]. This implies that selecting minimalistic high-quality acoustic parameters is necessary and sufficient for dealing with the cross-corpus SER tasks.
Secondly, upon further examination of the contributive LLD groups, it becomes apparent that the contributions of several acoustic parameters vary across different cross-corpus SER tasks, exhibiting high scores in some tasks while low scores in others. This suggests that there are no consistently highly-contributive acoustic parameters for all the cross-corpus SER tasks. However, it is interesting to note the presence of several ”stable” (varied but consistently contributive) emotion-discriminative and corpus-invariant acoustic parameters, such as MFCC, which consistently exhibit a satisfactory learned score. This insight inspires us to consider the possibility of testing and selecting acoustic parameters to develop a general minimalistic acoustic parameter feature set consisting of high-quality elements that are consistently emotion-discriminative and corpus-invariant. Such a set could potentially enhance the performance of transfer learning methods in addressing the challenge of cross-corpus SER.
III-C3 How Trade-off Parameters Affect the Performance of AKTLR?
In Eq.(4), our AKTLR requires to set three trade-off parameters: , , and . This raises the question of how the choice of these trade-off parameters affect the performance of AKTLR in addressing the challenge of cross-corpus SER. To investigate this point, we continue to conduct experiments using the eGeMAPS feature set on three cross-corpus SER tasks chosen above: , , and . We change the value of one trade-off parameter while keeping the others fixed, and monitor the experimental results of AKTLR. The intervals for the trade-off parameter values are set as for both and , and for . The fixed values for , , and are those used in the experiments described in Section III-B.
The results are illustrated in Figure 3. From this figure, it is evident that the performance of our AKTLR varies slightly with respect to the choice of and across all three cross-corpus SER tasks. However, in the case of , although the performance of AKTLR appears to be sensitive to changes in its value, AKTLR consistently performs within an acceptable range around the fixed value used in the experiments. In summary, we can conclude that the performance of our AKTLR is generally less sensitive to the choice of its associated trade-off parameters.
IV Conclusion
In this paper, we have addressed the challenge of cross-corpus SER from a new perspective by introducing a novel transfer subspace learning method called AKTLR. The primary contribution of AKTLR lies in its acoustic knowledge-guided dual sparsity constraint mechanism, which enables more effective learning of emotion-discriminative and corpus-invariant features at two different scales: acoustic parameter and feature. Compared with existing transfer subspace learning-based cross-corpus SER methods, AKTLR is the first domain-specific approach designed specifically under the guidance of established acoustic knowledge for cross-corpus SER. To evaluate the effectiveness of AKTLR, we conduct extensive cross-corpus SER experiments using three widely-used speech emotion corpora. The results demonstrate that AKTLR outperforms current SOTA transfer subspace learning and deep transfer learning-based cross-corpus SER methods. This confirms the efficacy and feasibility of leveraging acoustic knowledge to develop domain-specific transfer learning methods for cross-corpus SER.
References
- [1] M. B. Akçay and K. Oğuz, “Speech emotion recognition: Emotional models, databases, features, preprocessing methods, supporting modalities, and classifiers,” Speech Communication, vol. 116, pp. 56–76, 2020.
- [2] Y. B. Singh and S. Goel, “A systematic literature review of speech emotion recognition approaches,” Neurocomputing, vol. 492, pp. 245–263, 2022.
- [3] J. Wagner, A. Triantafyllopoulos, H. Wierstorf, M. Schmitt, F. Burkhardt, F. Eyben, and B. W. Schuller, “Dawn of the transformer era in speech emotion recognition: closing the valence gap,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
- [4] T. L. Nwe, S. W. Foo, and L. C. De Silva, “Speech emotion recognition using hidden markov models,” Speech communication, vol. 41, no. 4, pp. 603–623, 2003.
- [5] Z. Huang, M. Dong, Q. Mao, and Y. Zhan, “Speech emotion recognition using cnn,” in Proceedings of the 22nd ACM international conference on Multimedia, 2014, pp. 801–804.
- [6] H. M. Fayek, M. Lech, and L. Cavedon, “Evaluating deep learning architectures for speech emotion recognition,” Neural Networks, vol. 92, pp. 60–68, 2017.
- [7] C. Lu, W. Zheng, H. Lian, Y. Zong, C. Tang, S. Li, and Y. Zhao, “Speech emotion recognition via an attentive time–frequency neural network,” IEEE Transactions on Computational Social Systems, 2022.
- [8] C. Lu, Y. Zong, W. Zheng, Y. Li, C. Tang, and B. W. Schuller, “Domain invariant feature learning for speaker-independent speech emotion recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 2217–2230, 2022.
- [9] S. Zhang, X. Zhao, and Q. Tian, “Spontaneous speech emotion recognition using multiscale deep convolutional lstm,” IEEE Transactions on Affective Computing, vol. 13, no. 2, pp. 680–688, 2022.
- [10] B. Schuller, B. Vlasenko, F. Eyben, M. Wöllmer, A. Stuhlsatz, A. Wendemuth, and G. Rigoll, “Cross-corpus acoustic emotion recognition: Variances and strategies,” IEEE Transactions on Affective Computing, vol. 1, no. 2, pp. 119–131, 2010.
- [11] C. Parlak, B. Diri, and F. Gürgen, “A cross-corpus experiment in speech emotion recognition.” in SLAM@ INTERSPEECH, 2014, pp. 58–61.
- [12] S. J. Pan and Q. Yang, “A survey on transfer learning,” IEEE Transactions on knowledge and data engineering, vol. 22, no. 10, pp. 1345–1359, 2009.
- [13] S. Niu, Y. Liu, J. Wang, and H. Song, “A decade survey of transfer learning (2010–2020),” IEEE Transactions on Artificial Intelligence, vol. 1, no. 2, pp. 151–166, 2020.
- [14] A. Hassan, R. Damper, and M. Niranjan, “On acoustic emotion recognition: compensating for covariate shift,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 21, no. 7, pp. 1458–1468, 2013.
- [15] C. Cortes and V. Vapnik, “Support-vector networks,” Machine learning, vol. 20, pp. 273–297, 1995.
- [16] A. Gretton, A. Smola, J. Huang, M. Schmittfull, K. Borgwardt, B. Schölkopf et al., “Covariate shift by kernel mean matching,” Dataset shift in machine learning, vol. 3, no. 4, p. 5, 2009.
- [17] T. Kanamori, S. Hido, and M. Sugiyama, “A least-squares approach to direct importance estimation,” The Journal of Machine Learning Research, vol. 10, pp. 1391–1445, 2009.
- [18] Y. Tsuboi, H. Kashima, S. Hido, S. Bickel, and M. Sugiyama, “Direct density ratio estimation for large-scale covariate shift adaptation,” Journal of Information Processing, vol. 17, pp. 138–155, 2009.
- [19] P. Song, W. Zheng, S. Ou, X. Zhang, Y. Jin, J. Liu, and Y. Yu, “Cross-corpus speech emotion recognition based on transfer non-negative matrix factorization,” Speech Communication, vol. 83, pp. 34–41, 2016.
- [20] K. M. Borgwardt, A. Gretton, M. J. Rasch, H.-P. Kriegel, B. Schölkopf, and A. J. Smola, “Integrating structured biological data by kernel maximum mean discrepancy,” Bioinformatics, vol. 22, no. 14, pp. e49–e57, 2006.
- [21] H. Luo and J. Han, “Nonnegative matrix factorization based transfer subspace learning for cross-corpus speech emotion recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 2047–2060, 2020.
- [22] J. Parry, D. Palaz, G. Clarke, P. Lecomte, R. Mead, M. Berger, and G. Hofer, “Analysis of deep learning architectures for cross-corpus speech emotion recognition.” in Interspeech, 2019, pp. 1656–1660.
- [23] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” Advances in neural information processing systems, vol. 25, 2012.
- [24] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
- [25] Y. Zhao, J. Wang, R. Ye, Y. Zong, W. Zheng, and L. Zhao, “Deep transductive transfer regression network for cross-corpus speech emotion recognition,” Proceedings of the INTERSPEECH, Incheon, Korea, pp. 18–22, 2022.
- [26] Y. Zhao, J. Wang, Y. Zong, W. Zheng, H. Lian, and L. Zhao, “Deep implicit distribution alignment networks for cross-corpus speech emotion recognition,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5.
- [27] J. Gideon, M. G. McInnis, and E. M. Provost, “Improving cross-corpus speech emotion recognition with adversarial discriminative domain generalization (addog),” IEEE Transactions on Affective Computing, vol. 12, no. 4, pp. 1055–1068, 2019.
- [28] Y. Gao, S. Okada, L. Wang, J. Liu, and J. Dang, “Domain-invariant feature learning for cross corpus speech emotion recognition,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 6427–6431.
- [29] Y. Gao, L. Wang, J. Liu, J. Dang, and S. Okada, “Adversarial domain generalized transformer for cross-corpus speech emotion recognition,” IEEE Transactions on Affective Computing, 2023.
- [30] D. H. Wolpert and W. G. Macready, “No free lunch theorems for optimization,” IEEE transactions on evolutionary computation, vol. 1, no. 1, pp. 67–82, 1997.
- [31] C. E. Williams and K. N. Stevens, “Emotions and speech: Some acoustical correlates,” The journal of the acoustical society of America, vol. 52, no. 4B, pp. 1238–1250, 1972.
- [32] F. Eyben, K. R. Scherer, B. W. Schuller, J. Sundberg, E. André, C. Busso, L. Y. Devillers, J. Epps, P. Laukka, S. S. Narayanan et al., “The geneva minimalistic acoustic parameter set (gemaps) for voice research and affective computing,” IEEE transactions on affective computing, vol. 7, no. 2, pp. 190–202, 2015.
- [33] F. Burkhardt, A. Paeschke, M. Rolfes, W. F. Sendlmeier, B. Weiss et al., “A database of german emotional speech.” in Interspeech, vol. 5, 2005, pp. 1517–1520.
- [34] O. Martin, I. Kotsia, B. Macq, and I. Pitas, “The enterface’05 audio-visual emotion database,” in 22nd International Conference on Data Engineering Workshops (ICDEW’06). IEEE, 2006, pp. 8–8.
- [35] J. Zhang and H. Jia, “Design of speech corpus for mandarin text to speech,” in The blizzard challenge 2008 workshop, 2008.
- [36] G. Liu, Z. Lin, S. Yan, J. Sun, Y. Yu, and Y. Ma, “Robust recovery of subspace structures by low-rank representation,” IEEE transactions on pattern analysis and machine intelligence, vol. 35, no. 1, pp. 171–184, 2012.
- [37] W. Zheng, “Multi-view facial expression recognition based on group sparse reduced-rank regression,” IEEE Transactions on Affective Computing, vol. 5, no. 1, pp. 71–85, 2014.
- [38] J. Liu, S. Ji, J. Ye et al., “Slep: Sparse learning with efficient projections,” Arizona State University, vol. 6, no. 491, p. 7, 2009.
- [39] Z. Lin, M. Chen, and Y. Ma, “The augmented lagrange multiplier method for exact recovery of corrupted low-rank matrices,” arXiv preprint arXiv:1009.5055, 2010.
- [40] S. J. Pan, I. W. Tsang, J. T. Kwok, and Q. Yang, “Domain adaptation via transfer component analysis,” IEEE transactions on neural networks, vol. 22, no. 2, pp. 199–210, 2010.
- [41] B. Gong, Y. Shi, F. Sha, and K. Grauman, “Geodesic flow kernel for unsupervised domain adaptation,” in 2012 IEEE conference on computer vision and pattern recognition. IEEE, 2012, pp. 2066–2073.
- [42] B. Fernando, A. Habrard, M. Sebban, and T. Tuytelaars, “Unsupervised visual domain adaptation using subspace alignment,” in Proceedings of the IEEE international conference on computer vision, 2013, pp. 2960–2967.
- [43] N. Liu, Y. Zong, B. Zhang, L. Liu, J. Chen, G. Zhao, and J. Zhu, “Unsupervised cross-corpus speech emotion recognition using domain-adaptive subspace learning,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 5144–5148.
- [44] J. Zhang, L. Jiang, Y. Zong, W. Zheng, and L. Zhao, “Cross-corpus speech emotion recognition using joint distribution adaptive regression,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 3790–3794.
- [45] B. Schuller, S. Steidl, and A. Batliner, “The interspeech 2009 emotion challenge,” in Proc. Interspeech 2009, Brighton, UK, 2009, pp. 312–315.
- [46] F. Eyben, M. Wöllmer, and B. Schuller, “Opensmile: the munich versatile and fast open-source audio feature extractor,” in Proceedings of the 18th ACM international conference on Multimedia, 2010, pp. 1459–1462.
- [47] C.-C. Chang and C.-J. Lin, “Libsvm: a library for support vector machines,” ACM transactions on intelligent systems and technology (TIST), vol. 2, no. 3, pp. 1–27, 2011.
- [48] M. Long, Y. Cao, J. Wang, and M. Jordan, “Learning transferable features with deep adaptation networks,” in International conference on machine learning. PMLR, 2015, pp. 97–105.
- [49] M. Long, H. Zhu, J. Wang, and M. I. Jordan, “Deep transfer learning with joint adaptation networks,” in International conference on machine learning. PMLR, 2017, pp. 2208–2217.
- [50] Y. Zhu, F. Zhuang, J. Wang, G. Ke, J. Chen, J. Bian, H. Xiong, and Q. He, “Deep subdomain adaptation network for image classification,” IEEE transactions on neural networks and learning systems, vol. 32, no. 4, pp. 1713–1722, 2020.
- [51] H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, and M. Marchand, “Domain-adversarial neural networks,” arXiv preprint arXiv:1412.4446, 2014.
- [52] M. Long, Z. Cao, J. Wang, and M. I. Jordan, “Conditional adversarial domain adaptation,” Advances in neural information processing systems, vol. 31, 2018.
- [53] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.