Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
License: arXiv.org perpetual non-exclusive license
arXiv:2312.06466v1 [cs.SD] 11 Dec 2023

Towards Domain-Specific Cross-Corpus Speech Emotion Recognition Approach

Yan Zhao, Yuan Zong*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT,  Hailun Lian, Cheng Lu, Jingang Shi, and Wenming Zheng*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT Y. Zhao and H. Lian are with the Key Laboratory of Child Development and Learning Science of Ministry of Education, Southeast University, Nanjing 211189, China, and also with the School of Information Science and Engineering, Southeast University, Nanjing 211189, China.Y. Zong, W. Zheng, and C. Lu are with the Key Laboratory of Child Development and Learning Science of Ministry of Education, Southeast University, Nanjing 211189, China, and also with the School of Biological Science and Medical Engineering, Southeast University, Nanjing 211189, China.J. Shi is with the School of Software, Xi’an Jiao Tong University, Xi’an 710049, China.*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT Corresponding authors.
Abstract

Cross-corpus speech emotion recognition (SER) poses a challenge due to feature distribution mismatch, potentially degrading the performance of established SER methods. In this paper, we tackle this challenge by proposing a novel transfer subspace learning method called acoustic knowledge-guided transfer linear regression (AKTLR). Unlike existing approaches, which often overlook domain-specific knowledge related to SER and simply treat cross-corpus SER as a generic transfer learning task, our AKITR method is built upon a well-designed acoustic knowledge-guided dual sparsity constraint mechanism. This mechanism emphasizes the potential of minimalistic acoustic parameter feature sets to alleviate classifier over-adaptation, which is empirically validated acoustic knowledge in SER, enabling superior generalization in cross-corpus SER tasks compared to using large feature sets. Through this mechanism, we extend a simple transfer linear regression model to AKTLR. This extension harnesses its full capability to seek emotion-discriminative and corpus-invariant features from established acoustic parameter feature sets used for describing speech signals across two scales: contributive acoustic parameter groups and constituent elements within each contributive group. Our proposed method is evaluated through extensive cross-corpus SER experiments on three widely-used speech emotion corpora: EmoDB, eNTERFACE, and CASIA. The results confirm the effectiveness and superior performance of our method, outperforming recent state-of-the-art transfer subspace learning and deep transfer learning-based cross-corpus SER methods. Furthermore, our work provides experimental evidence supporting the feasibility and superiority of incorporating domain-specific knowledge into the transfer learning model to address cross-corpus SER tasks.

Index Terms:
Cross-corpus speech emotion recognition, speech emotion recognition, transfer subspace learning, domain adaptation, domain-specific knowledge.

I Introduction

Speech plays a crucial role in human daily communication, serving as a natural means for individuals to express their emotions such as Happiness, Fear, and Sadness. As a result, the research of speech emotion recognition (SER) [1, 2, 3], which seeks to empower computers to automatically understand emotional states from speech signals, holds significant practical value. Over the past few decades, SER has garnered substantial attention within the communities of human-computer interaction, affective computing, and signal processing, leading to the development of numerous well-performing SER methods [4, 5, 6, 7, 8, 9].

However, it is important to note that most established SER methods, including those mentioned above, primarily focus on an ideal scenario where the training and testing speech signals belong to the same speech emotion corpus. In practical situations, the testing speech signals may differ significantly from the training speech signals, exhibiting variations in numerous factors, such as languages, recording equipment, and environmental conditions. This gives rise to a challenging but intriguing task known as cross-corpus SER [10] within the field of SER. In cross-corpus SER tasks, the training and testing speech signals originate from different speech emotion corpora and can be referred to as the source and target signals, respectively. Moreover, while we have access to ground truth emotion labels for the source speech samples, the target speech emotion corpus remains entirely unlabeled.

In the early stages, the research of cross-corpus SER mostly focus on feature engineering, aiming to enhance the corpus-invariant ability of acoustic parameter feature sets used to describe speech signals. For example, in the work of [10], three feature normalization schemes, including corpus normalization, speaker normalization, and speaker-corpus normalization, are designed to address feature distribution mismatches between source and target speech emotion corpora. Subsequently, Parlak et al. [11] attempt to use numerous feature selectors, such as linear forward selection, to seek high-quality speech features that are robust to corpus variance from existing comprehensive acoustic feature sets. In recent years, inspired by the tremendous success of transfer learning in various cross-domain recognition tasks [12, 13], researchers have shifted their focus to the development of transfer learning methods for cross-corpus SER. These methods have achieved promising performance in recognizing emotions in speech signals across different corpora, marking a significant advancement in this field.

Broadly, current transfer learning-based cross-corpus SER methods can be classfied into two types, including Transfer Subspace Learning and Deep Transfer Learning:

(1) Transfer subspace learning-based cross-corpus SER methods typically begin by using a set of acoustic low-level descriptors (LLDs), such as fundamental frequency (F0) and Mel-frequency cepstral coefficients (MFCC), along with their associated functions, such as maximal and mean values, to describe the source and target speech signals. Subsequently, a transfer subspace learning model is developed to mitigate the distribution mismatch between the two feature sets. One early method can be traced back to the work of [14], in which Hassan et al. extend the support vector machine (SVM) [15] to an importance-weighted SVM (IW-SVM) for cross-corpus SER. IW-SVM incorporates three different transfer subspace learning models: kernel mean matching (KMM) [16], unconstrained least-squares importance fitting (uLSIF) [17], and Kullback-Leibler importance estimation procedure (KLIEP) [18]. It is hence enabled to learn a set of weights for the source speech samples, ensuring that the weighted source speech feature sets align with the distribution of target speech feature sets. Another notable work is the transfer non-negative matrix factorization (TNNMF) models designed by Song et al. [19]. These models integrate the maximum mean discrepancy (MMD) [20] to measure and minimize the discrepancies between the source and target speech feature distributions. Following this work, Luo et al. [21] further advance TNNMF model by jointly reducing the marginal and class-aware conditional feature distribution gaps between the two different speech sample sets.

(2) In contrast to transfer subspace learning, deep transfer learning methods often utilize the speech spectrums of the original speech signals as input for deep neural networks, harnessing their powerful nonlinear representation capabilities to learn emotion-discriminative and corpus-invariant features. Parry et al. [22] examine the generalization capacity of deep neural networks for cross-corpus SER across six different speech emotion corpora. Their experimental results demonstrate that convolutional neural networks (CNNs) [23] exhibit superior generalisation capabilities compared to recurrent neural networks (RNNs) [24]. Insipired by this observation, Zhao et al. [25] propose deep transductive transfer regression networks (DTTRN) based on CNN architectures. A key contribution of DTTRN is the incorporation of additional fine-grained emotion class-aware conditional MMD, which aids in better bridging the distribution gap between learned source and target features compared to the original MMD. Additionally, Zhao et al. [26] introduce another CNN-based deep transfer learning method called deep implicit distribution alignment neural networks (DIDAN). DIDAN performs implicit distribution alignment for source and target speech corpora by replacing the minimization of MMD with sparsely reconstructing target samples using source samples. More recently, domain-adversarial learning-based models [27, 28, 29] have been developed to learn more generalized representations of speech signals for cross-corpus SER. The key concept behind these methods are the introduction of an additional domain (corpus) classifier, which enables the deep neural networks to learn the generalized features to describe speech signals, regardless of their corpus sources.

While both transfer subspace learning and deep transfer learning methods have demonstrated success in addressing the challenge of cross-corpus SER, it is worth noting that these methods often approach cross-corpus SER as a generic transfer learning task. This means that most of these methods focus primarily on developing transfer learning models without specifically considering the valuable acoustic knowledge inherent to SER. As a result, these transfer learning methods can be applied to other cross-domain recognition tasks without making significant modifications. According to the ”No Free Lunch Theorem” [30], it is established that ”There is no universal learning algorithm that can provide the best solution for every problem. Each algorithm has its strengths and weaknesses, and its performance is highly dependent on the specific problem domain and data distribution.” From this perspective, it can be argued that they may not offer ultimately satisfactory solutions for cross-corpus SER. In other words, incorporating domain-specific knowledge from SER to guide the design of transfer learning models could potentially lead to even better performance compared to the generic transfer learning models when dealing with cross-corpus SER. Therefore, our goal in this paper is to develop a domain-specific transfer learning approach for cross-corpus SER. Specifically, we propose a novel transfer subspace learning method called acoustic knowledge-guided transfer linear regression (AKTLR).

Refer to caption
Figure 1: Acoustic Knowledge-Guided Dual Sparsity Constraint Mechanism: The Concept behind the Proposed AKTLR Method for Addressing Cross-Corpus SER Tasks.

The basic concept behind AKTLR comes from the empirically validated acoustic knowledge about the acoustic parameter feature sets designed for describing speech signals and their cross-corpus recognition performance evaluation within SER [31, 11, 32]. These works inform us that selectively minimalistic high-quality acoustic parameters are more capable of exhibiting superior generalization ability to variance in speech emotion corpus.Therefore, selecting these acoustic parameters may enable the transfer subspace learning models to achieve more promising recognition performance in cross-corpus SER tasks compared to directly using larger feature sets comprising comprehensive acoustic parameters. This insight motivates us to introduce an acoustic knowledge-guided dual sparsity constraint mechanism, illustrated in Fig. 1, to develop AKTLR model for cross-corpus SER. As depicted in Fig. 1, this mechanism equips the AKTLR model to proficiently discern emotion-discriminative and corpus-invariant features from established acoustic parameter feature sets at both coarse-grained and fine-grained scales. Specifically, it begins by measuring the contribution scores of different acoustic LLDs, and subsequently selects truly contributive derived features from each of LLD groups with high contribution scores.

To evaluate the effectiveness of AKTLR, we conduct extensive cross-corpus SER experiments using three widely-used speech emotion corpora: EmoDB [33], eNTERFACE [34], and CASIA [35]. The experimental results demonstrate that our AKTLR outperforms recent state-of-the-art transfer subspace learning and deep transfer learning-based cross-corpus SER methods, showcasing the effectiveness of incorporating domain-specific knowledge into transfer subspace learning for cross-corpus SER. In summary, this paper makes three primary contributions:

  1. 1.

    We propose AKTLR, a novel transfer subspace learning method inspired by empirically verified acoustic knowledge, making it the first work to propose a domain-specific approach for cross-corpus SER.

  2. 2.

    We introduce an acoustic knowledge-guided dual sparsity constraint mechanism to guide the design of AKTLR. This mechanism enables AKTLR to effectively seek emotion-discriminative and corpus-invariant features from established acoustic parameter feature sets, operating at two different scales, for cross-corpus SER.

  3. 3.

    We perform extensive cross-corpus SER experiments using three widely-used speech emotion corpora to assess the effectiveness of AKTLR. The experimental results demonstrate the superior performance of AKTLR in addressing the challenge of cross-corpus SER.

The subsequent sections of this paper are structured as follows: Section II provides detailed explanations of the proposed AKTLR method. In Section III, we evaluate the performance of the AKTLR method in tackling the challenge of cross-corpus SER. Finally, the paper is concluded in Section IV.

II Proposed Method

II-A Notations

In this section, we will provide a detailed description of the proposed AKTLR model and demonsrtate how to utilize this model to address cross-corpus SER tasks. Before delving into the model specifics, let us establish a set of notations necessary for constructing the model. Suppose we have a source speech emotion corpus comprising Nssubscript𝑁𝑠N_{s}italic_N start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT samples, with its feature matrix denoted as 𝐗s=[𝐗1sT,,𝐗GsT]Td×Ns(d=i=1Gdi)superscript𝐗𝑠superscriptsuperscriptsuperscriptsubscript𝐗1𝑠𝑇superscriptsuperscriptsubscript𝐗𝐺𝑠𝑇𝑇superscript𝑑subscript𝑁𝑠𝑑superscriptsubscript𝑖1𝐺subscript𝑑𝑖\mathbf{X}^{s}=[{\mathbf{X}_{1}^{s}}^{T},\cdots,{\mathbf{X}_{G}^{s}}^{T}]^{T}% \in\mathbb{R}^{d\times N_{s}}~{}(d=\sum_{i=1}^{G}d_{i})bold_X start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT = [ bold_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT , ⋯ , bold_X start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_N start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_d = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ). Here, Nssubscript𝑁𝑠N_{s}italic_N start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT represents the dimension of acoustic parameter feature vector and d𝑑ditalic_d represents the feature dimension. 𝐗isdi×Nssuperscriptsubscript𝐗𝑖𝑠superscriptsubscript𝑑𝑖subscript𝑁𝑠\mathbf{X}_{i}^{s}\in\mathbb{R}^{d_{i}\times N_{s}}bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT × italic_N start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUPERSCRIPT represents the features derived from a LLD group (one specific LLD or more closely-related LLDs), such as MFCC or energy-based features, within G𝐺Gitalic_G LLD groups used to design acoustic parameter features for describing speech signals. The corresponding emotion label matrix of the source speech samples is expressed as 𝐘s=[𝐲1s,,𝐲Nss]C×Nssubscript𝐘𝑠superscriptsubscript𝐲1𝑠superscriptsubscript𝐲subscript𝑁𝑠𝑠superscript𝐶subscript𝑁𝑠\mathbf{Y}_{s}=[\mathbf{y}_{1}^{s},\cdots,\mathbf{y}_{N_{s}}^{s}]\in\mathbb{R}% ^{C\times N_{s}}bold_Y start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = [ bold_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT , ⋯ , bold_y start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ] ∈ blackboard_R start_POSTSUPERSCRIPT italic_C × italic_N start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUPERSCRIPT. Each column 𝐲js=[yj,1,,yj,C]Tsuperscriptsubscript𝐲𝑗𝑠superscriptsubscript𝑦𝑗1subscript𝑦𝑗𝐶𝑇\mathbf{y}_{j}^{s}=[y_{j,1},\cdots,y_{j,C}]^{T}bold_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT = [ italic_y start_POSTSUBSCRIPT italic_j , 1 end_POSTSUBSCRIPT , ⋯ , italic_y start_POSTSUBSCRIPT italic_j , italic_C end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT is a one-hot vector associated with the jthsuperscript𝑗𝑡j^{th}italic_j start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT speech sample. The kthsuperscript𝑘𝑡k^{th}italic_k start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT entry in 𝐲jssuperscriptsubscript𝐲𝑗𝑠\mathbf{y}_{j}^{s}bold_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT is set as 1 if the corresponding speech sample expresses the kthsuperscript𝑘𝑡k^{th}italic_k start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT emotion within emotion set {1,,C}1𝐶\{1,\cdots,C\}{ 1 , ⋯ , italic_C }, and 0 otherwise. Similarly, the target speech feature matrix can be denoted as 𝐗t=[𝐗1tT,,𝐗GsT]Td×Ntsuperscript𝐗𝑡superscriptsuperscriptsuperscriptsubscript𝐗1𝑡𝑇superscriptsuperscriptsubscript𝐗𝐺𝑠𝑇𝑇superscript𝑑subscript𝑁𝑡\mathbf{X}^{t}=[{\mathbf{X}_{1}^{t}}^{T},\cdots,{\mathbf{X}_{G}^{s}}^{T}]^{T}% \in\mathbb{R}^{d\times N_{t}}bold_X start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = [ bold_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT , ⋯ , bold_X start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, where Ntsubscript𝑁𝑡N_{t}italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is number of samples in the target speech emotion corpus.

II-B AKTLR Model

As previously described and illustrated in Fig. 1, our AKTLR method is designed based on a simple transfer linear regression model and an acoustic knowledge-guided dual sparsity constraint mechanism. This design enables the model to effectively seek high-quality speech features that are emotion-discriminative and corpus-invariant at two scales within a comprehensive feature set consisting of various acoustic parameters and their derived features. This facilitates the connection of emotions expressed in speech signals from different corpora. To achieve this, we design the following optimization problem for AKTLR:

min𝐏,αtlr+μds,s.t.,α0.\displaystyle\min_{\mathbf{P},\alpha}\mathcal{L}_{tlr}+\mu\mathcal{L}_{ds},~{}% s.t.,~{}\alpha\succeq 0.roman_min start_POSTSUBSCRIPT bold_P , italic_α end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_t italic_l italic_r end_POSTSUBSCRIPT + italic_μ caligraphic_L start_POSTSUBSCRIPT italic_d italic_s end_POSTSUBSCRIPT , italic_s . italic_t . , italic_α ⪰ 0 . (1)

In Eq.(1), tlrsubscript𝑡𝑙𝑟\mathcal{L}_{tlr}caligraphic_L start_POSTSUBSCRIPT italic_t italic_l italic_r end_POSTSUBSCRIPT and dssubscript𝑑𝑠\mathcal{L}_{ds}caligraphic_L start_POSTSUBSCRIPT italic_d italic_s end_POSTSUBSCRIPT represent the loss functions corresponding to a simple Transfer Linear Regression model and newly designed Acoustic Knowledge-guided Dual Sparsity Constraint Mechanism, respectively. The parameter μ𝜇\muitalic_μ serves as a trade-off parameter that controls the balance between these two functions. It is important to note that 𝐏C×d𝐏superscript𝐶𝑑\mathbf{P}\in\mathbb{R}^{C\times d}bold_P ∈ blackboard_R start_POSTSUPERSCRIPT italic_C × italic_d end_POSTSUPERSCRIPT is the regression coefficient matrix to be learned in AKTLR, and α=[α1,,αG]T𝛼superscriptsubscript𝛼1subscript𝛼𝐺𝑇\alpha=[\alpha_{1},\cdots,\alpha_{G}]^{T}italic_α = [ italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_α start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT is a contribution score vector and also a model parameter of AKTLR. Each entry in this vector, αisubscript𝛼𝑖\alpha_{i}italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, is a non-negative value and measures the contribution of its corresponding acoustic parameter feature derived from a LLD group, in recognizing emotions across speech corpora. In what follows, we describe the details of the key loss functions in AKTLR.

II-B1 Loss Function for Transfer Linear Regression

The loss function corresponding to transfer linear regression, denoted as tlrsubscript𝑡𝑙𝑟\mathcal{L}_{tlr}caligraphic_L start_POSTSUBSCRIPT italic_t italic_l italic_r end_POSTSUBSCRIPT, can be formulated as follows:

tlr=𝐘si=1Gαi𝐏i𝐗isF2+λ1i=1Gαi𝐏iΔ𝐱¯ist2,subscript𝑡𝑙𝑟superscriptsubscriptnormsuperscript𝐘𝑠superscriptsubscript𝑖1𝐺subscript𝛼𝑖subscript𝐏𝑖superscriptsubscript𝐗𝑖𝑠𝐹2subscript𝜆1superscriptnormsuperscriptsubscript𝑖1𝐺subscript𝛼𝑖subscript𝐏𝑖Δsuperscriptsubscript¯𝐱𝑖𝑠𝑡2\displaystyle\mathcal{L}_{tlr}=\|\mathbf{Y}^{s}-\sum_{i=1}^{G}\alpha_{i}% \mathbf{P}_{i}\mathbf{X}_{i}^{s}\|_{F}^{2}+\lambda_{1}\|\sum_{i=1}^{G}\alpha_{% i}\mathbf{P}_{i}\Delta{\bar{\mathbf{x}}}_{i}^{st}\|^{2},caligraphic_L start_POSTSUBSCRIPT italic_t italic_l italic_r end_POSTSUBSCRIPT = ∥ bold_Y start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT - ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_Δ over¯ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s italic_t end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (2)

where 𝐏=[𝐏1T,,𝐏GT]T(𝐏iC×di)𝐏superscriptsuperscriptsubscript𝐏1𝑇superscriptsubscript𝐏𝐺𝑇𝑇subscript𝐏𝑖superscript𝐶subscript𝑑𝑖\mathbf{P}=[\mathbf{P}_{1}^{T},\cdots,\mathbf{P}_{G}^{T}]^{T}~{}(\mathbf{P}_{i% }\in\mathbb{R}^{C\times d_{i}})bold_P = [ bold_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT , ⋯ , bold_P start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( bold_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_C × italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ), Δ𝐱¯ist=1Ns𝐗s𝟏Ns1Nt𝐗t𝟏NtΔsuperscriptsubscript¯𝐱𝑖𝑠𝑡1subscript𝑁𝑠superscript𝐗𝑠subscript1subscript𝑁𝑠1subscript𝑁𝑡superscript𝐗𝑡subscript1subscript𝑁𝑡\Delta{\bar{\mathbf{x}}}_{i}^{st}=\frac{1}{N_{s}}\mathbf{X}^{s}\mathbf{1}_{N_{% s}}-\frac{1}{N_{t}}\mathbf{X}^{t}\mathbf{1}_{N_{t}}roman_Δ over¯ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s italic_t end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_ARG bold_X start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT bold_1 start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUBSCRIPT - divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_X start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT bold_1 start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT is the mean difference between the source and target speech feature vectors associated with the ithsuperscript𝑖𝑡i^{th}italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT LLD group, and λ1subscript𝜆1\lambda_{1}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is the trade-off parameter.

The loss function tlrsubscript𝑡𝑙𝑟\mathcal{L}_{tlr}caligraphic_L start_POSTSUBSCRIPT italic_t italic_l italic_r end_POSTSUBSCRIPT consists of two main terms. The first term, 𝐘si=1Gαi𝐏i𝐗isF2superscriptsubscriptnormsuperscript𝐘𝑠superscriptsubscript𝑖1𝐺subscript𝛼𝑖subscript𝐏𝑖superscriptsubscript𝐗𝑖𝑠𝐹2\|\mathbf{Y}^{s}-\sum_{i=1}^{G}\alpha_{i}\mathbf{P}_{i}\mathbf{X}_{i}^{s}\|_{F% }^{2}∥ bold_Y start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT - ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, represents a weighted linear regression function that establishes the relationship between the source speech feature sets and their ground truth emotion labels. Minimizing this term enables the proposed AKTLR to seek a subspace to distinguish different emotions expressed in speech signals. The second term, i=1Gαi𝐏iΔ𝐱¯ist2superscriptnormsuperscriptsubscript𝑖1𝐺subscript𝛼𝑖subscript𝐏𝑖Δsuperscriptsubscript¯𝐱𝑖𝑠𝑡2\|\sum_{i=1}^{G}\alpha_{i}\mathbf{P}_{i}\Delta{\bar{\mathbf{x}}}_{i}^{st}\|^{2}∥ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_Δ over¯ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s italic_t end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, measures the distribution gap between the source and target speech feature sets in such subspace using the one-order statistical moment, the mean value. Minimizing this term encourages the source and target feature sets to have similar feature distributions in such subspace. Thus, the proposed AKTLR is also applicable to distinguish emotions expressed in target speech signals.

II-B2 Loss Function for Acoustic Knowledge-guided Dual Sparsity Constraint Mechanism

The loss function for the acoustic knowledge-guided dual sparsity constraint mechanism is designed as follows:

ds=α1+τi=1G𝐏i2,1(α0).subscript𝑑𝑠subscriptnorm𝛼1𝜏superscriptsubscript𝑖1𝐺subscriptnormsubscript𝐏𝑖21succeeds-or-equals𝛼0\displaystyle\mathcal{L}_{ds}=\|\alpha\|_{1}+\tau\sum_{i=1}^{G}\|\mathbf{P}_{i% }\|_{2,1}~{}(\alpha\succeq 0).caligraphic_L start_POSTSUBSCRIPT italic_d italic_s end_POSTSUBSCRIPT = ∥ italic_α ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_τ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT ∥ bold_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 , 1 end_POSTSUBSCRIPT ( italic_α ⪰ 0 ) . (3)

Here, τ𝜏\tauitalic_τ is the trade-off parameter. This loss function consists of two major terms: the l1subscript𝑙1l_{1}italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT norm with respect to α𝛼\alphaitalic_α and the l2,1subscript𝑙21l_{2,1}italic_l start_POSTSUBSCRIPT 2 , 1 end_POSTSUBSCRIPT norm with respect to 𝐏isubscript𝐏𝑖\mathbf{P}_{i}bold_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Minimizing this loss function enforces the proposed AKTLR to learn a non-negative sparse α𝛼\alphaitalic_α and column-sparse 𝐏isubscript𝐏𝑖\mathbf{P}_{i}bold_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. The non-negative sparse α𝛼\alphaitalic_α allows the AKTLR model to measure the specific contributions of different acoustic parameter features at a coarse-grained scale of LLD group, suppressing the less-contributive ones, while highlighting highly-contributive ones. Additionally, the column-sparse 𝐏isubscript𝐏𝑖\mathbf{P}_{i}bold_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT further enhance AKTLR by performing fine-grained feature selection to suppress low-quality acoustic parameter features derived from LLD groups with high contribution scores.

II-B3 Optimization Problem of AKTLR

By incorporating the formulations of the two loss functions as shown in Eqs.(2) and (3) into Eq.(1), we can derive the ultimate optimization problem for training the proposed AKTLR models, which is expressed as follows:

min𝐏i,α𝐘si=1Gαi𝐏i𝐗isF2+λ1i=1Gαi𝐏iΔ𝐱¯ist2subscriptsubscript𝐏𝑖𝛼superscriptsubscriptnormsuperscript𝐘𝑠superscriptsubscript𝑖1𝐺subscript𝛼𝑖subscript𝐏𝑖superscriptsubscript𝐗𝑖𝑠𝐹2subscript𝜆1superscriptnormsuperscriptsubscript𝑖1𝐺subscript𝛼𝑖subscript𝐏𝑖Δsuperscriptsubscript¯𝐱𝑖𝑠𝑡2\displaystyle\min_{\mathbf{P}_{i},\alpha}\|\mathbf{Y}^{s}-\sum_{i=1}^{G}\alpha% _{i}\mathbf{P}_{i}\mathbf{X}_{i}^{s}\|_{F}^{2}+\lambda_{1}\|\sum_{i=1}^{G}% \alpha_{i}\mathbf{P}_{i}\Delta\bar{\mathbf{x}}_{i}^{st}\|^{2}~{}~{}~{}~{}roman_min start_POSTSUBSCRIPT bold_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_α end_POSTSUBSCRIPT ∥ bold_Y start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT - ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_Δ over¯ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s italic_t end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
+λ2α1+λ3i=1G𝐏i2,1,subscript𝜆2subscriptnorm𝛼1subscript𝜆3superscriptsubscript𝑖1𝐺subscriptnormsubscript𝐏𝑖21\displaystyle+\lambda_{2}\|\alpha\|_{1}+\lambda_{3}\sum_{i=1}^{G}\|\mathbf{P}_% {i}\|_{2,1},+ italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ italic_α ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT ∥ bold_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 , 1 end_POSTSUBSCRIPT ,
s.t.α0.formulae-sequence𝑠𝑡succeeds-or-equals𝛼0\displaystyle s.t.~{}\alpha\succeq 0.~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{% }~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{% }~{}~{}~{}~{}~{}~{}~{}~{}italic_s . italic_t . italic_α ⪰ 0 . (4)

where λ1subscript𝜆1\lambda_{1}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, λ2=μsubscript𝜆2𝜇\lambda_{2}=\muitalic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = italic_μ, and λ3=μ×τsubscript𝜆3𝜇𝜏\lambda_{3}=\mu\times\tauitalic_λ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT = italic_μ × italic_τ are the trade-off parameters that control the balance among the key terms in the total loss function of AKTLR.

Algorithm 1 Updating Procedures for Learning the Optimal 𝐏𝐏\mathbf{P}bold_P in Eq.(11).

(1) Fix 𝐏𝐏\mathbf{P}bold_P, 𝐓𝐓\mathbf{T}bold_T, κ𝜅\kappaitalic_κ, and Minimize \mathcal{L}caligraphic_L w.r.t. 𝐐𝐐\mathbf{Q}bold_Q: This step is equivalent to solving the following optimization problem:

Note that this optimization problem has a closed-form solution, which can be expressed as:

where 𝐈𝐈\mathbf{I}bold_I is a d𝑑ditalic_d-by-d𝑑ditalic_d identity matrix.
(2) Fix 𝐐𝐐\mathbf{Q}bold_Q, 𝐓𝐓\mathbf{T}bold_T, κ𝜅\kappaitalic_κ, and Minimize \mathcal{L}caligraphic_L w.r.t. 𝐏𝐏\mathbf{P}bold_P: In this step, we are required to solve the following optimization problem:

which can be reformulated as follows:

According to Lemma 4.1 shown in the work of [36], the optimal solution to the above optimization problem is

where 𝐩isubscript𝐩𝑖\mathbf{p}_{i}bold_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, 𝐪isubscript𝐪𝑖\mathbf{q}_{i}bold_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, and 𝐭isubscript𝐭𝑖\mathbf{t}_{i}bold_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are the ithsuperscript𝑖𝑡i^{th}italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT column of 𝐏𝐏\mathbf{P}bold_P, 𝐐𝐐\mathbf{Q}bold_Q, and 𝐓𝐓\mathbf{T}bold_T, respectively.
(3) Update 𝐓𝐓\mathbf{T}bold_T and κ𝜅\kappaitalic_κ:

where ρ>1𝜌1\rho>1italic_ρ > 1 and κmaxsubscript𝜅𝑚𝑎𝑥\kappa_{max}italic_κ start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT is the preset maximal value for κ𝜅\kappaitalic_κ.
(4) Check Convergence:

where ϵitalic-ϵ\epsilonitalic_ϵ is the machine epsilon value.

II-C Optimization of AKTLR

The optimization problem for training AKTLR, as shown in Eq.(4), can be effectively addressed using the alternated direction method (ADM) [37]. Specifically, the optimal parameters in AKTLR, represented by 𝐏^isubscript^𝐏𝑖\hat{\mathbf{P}}_{i}over^ start_ARG bold_P end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and α^^𝛼\hat{\alpha}over^ start_ARG italic_α end_ARG, can be obtained through the following iterative steps:

(1) Fix 𝐏isubscript𝐏𝑖\mathbf{P}_{i}bold_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and Update α𝛼\alphaitalic_α: In this step, the optimization problem becomes one with respect to α𝛼\alphaitalic_α, which can be formulated as follows:

minα𝐘si=1Gαi𝐏i𝐗isF2+λ1i=1Gαi𝐏iΔ𝐱¯ist2subscript𝛼superscriptsubscriptnormsuperscript𝐘𝑠superscriptsubscript𝑖1𝐺subscript𝛼𝑖subscript𝐏𝑖superscriptsubscript𝐗𝑖𝑠𝐹2subscript𝜆1superscriptnormsuperscriptsubscript𝑖1𝐺subscript𝛼𝑖subscript𝐏𝑖Δsuperscriptsubscript¯𝐱𝑖𝑠𝑡2\displaystyle\min_{\alpha}\|\mathbf{Y}^{s}-\sum_{i=1}^{G}\alpha_{i}\mathbf{P}_% {i}\mathbf{X}_{i}^{s}\|_{F}^{2}+\lambda_{1}\|\sum_{i=1}^{G}\alpha_{i}\mathbf{P% }_{i}\Delta\bar{\mathbf{x}}_{i}^{st}\|^{2}~{}~{}~{}~{}roman_min start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ∥ bold_Y start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT - ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_Δ over¯ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s italic_t end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
+λ2α1,subscript𝜆2subscriptnorm𝛼1\displaystyle+\lambda_{2}\|\alpha\|_{1},+ italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ italic_α ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ,
s.t.α0.formulae-sequence𝑠𝑡succeeds-or-equals𝛼0\displaystyle s.t.~{}\alpha\succeq 0.~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{% }~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{% }~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}italic_s . italic_t . italic_α ⪰ 0 . (8)

Let 𝐘=[𝐘s,𝟎]𝐘superscript𝐘𝑠0\mathbf{Y}=[\mathbf{Y}^{s},\mathbf{0}]bold_Y = [ bold_Y start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT , bold_0 ] and 𝐗~i=[𝐗is,λ1Δ𝐱¯ist]subscript~𝐗𝑖superscriptsubscript𝐗𝑖𝑠subscript𝜆1Δsuperscriptsubscript¯𝐱𝑖𝑠𝑡\tilde{\mathbf{X}}_{i}=[\mathbf{X}_{i}^{s},\sqrt{\lambda_{1}}\Delta\bar{% \mathbf{x}}_{i}^{st}]over~ start_ARG bold_X end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = [ bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT , square-root start_ARG italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG roman_Δ over¯ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s italic_t end_POSTSUPERSCRIPT ], where 𝟎C×10superscript𝐶1\mathbf{0}\in\mathbb{R}^{C\times 1}bold_0 ∈ blackboard_R start_POSTSUPERSCRIPT italic_C × 1 end_POSTSUPERSCRIPT is a vector of all zero values. Then, the optimization problem in Eq.(8) can be rewritten as:

minα𝐘i=1Gαi𝐏i𝐗~iF2+λ2α1.subscript𝛼superscriptsubscriptnorm𝐘superscriptsubscript𝑖1𝐺subscript𝛼𝑖subscript𝐏𝑖subscript~𝐗𝑖𝐹2subscript𝜆2subscriptnorm𝛼1\displaystyle\min_{\alpha}\|\mathbf{Y}-\sum_{i=1}^{G}\alpha_{i}\mathbf{P}_{i}% \tilde{\mathbf{X}}_{i}\|_{F}^{2}+\lambda_{2}\|\alpha\|_{1}.roman_min start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ∥ bold_Y - ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT over~ start_ARG bold_X end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ italic_α ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT . (9)

Subsequently, let 𝐳i=Flatten(𝐏i𝐗~i)(i={1,,G})subscript𝐳𝑖𝐹𝑙𝑎𝑡𝑡𝑒𝑛subscript𝐏𝑖subscript~𝐗𝑖𝑖1𝐺\mathbf{z}_{i}=Flatten(\mathbf{P}_{i}\tilde{\mathbf{X}}_{i})~{}(i=\{1,\cdots,G\})bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_F italic_l italic_a italic_t italic_t italic_e italic_n ( bold_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT over~ start_ARG bold_X end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ( italic_i = { 1 , ⋯ , italic_G } ) and 𝐲=Flatten(𝐘)𝐲𝐹𝑙𝑎𝑡𝑡𝑒𝑛𝐘\mathbf{y}=Flatten(\mathbf{Y})bold_y = italic_F italic_l italic_a italic_t italic_t italic_e italic_n ( bold_Y ), where Flatten()𝐹𝑙𝑎𝑡𝑡𝑒𝑛Flatten(\cdot)italic_F italic_l italic_a italic_t italic_t italic_e italic_n ( ⋅ ) is an operation that reshapes a matrix into a vector column by column. We are thus able to further restate the objective function in Eq.(9) as the following formulation:

minα𝐲~𝐙α2+λ2α1,s.t.α0,formulae-sequencesubscript𝛼superscriptnorm~𝐲𝐙𝛼2subscript𝜆2subscriptnorm𝛼1𝑠𝑡succeeds-or-equals𝛼0\displaystyle\min_{\alpha}\|\tilde{\mathbf{y}}-\mathbf{Z}\alpha\|^{2}+\lambda_% {2}\|\alpha\|_{1},~{}s.t.~{}\alpha\succeq 0,roman_min start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ∥ over~ start_ARG bold_y end_ARG - bold_Z italic_α ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ italic_α ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_s . italic_t . italic_α ⪰ 0 , (10)

where 𝐙=[𝐳1,,𝐳G]𝐙subscript𝐳1subscript𝐳𝐺\mathbf{Z}=[\mathbf{z}_{1},\cdots,\mathbf{z}_{G}]bold_Z = [ bold_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , bold_z start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ]. It is apparent that Eq.(10) represents a standard non-negative LASSO problem, and we utilize the SLEP package [38] to solve it.

(2) Fix α𝛼\alphaitalic_α and Update 𝐏isubscript𝐏𝑖\mathbf{P}_{i}bold_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT: The optimization problem in this step can be formulated as follows:

min𝐏𝐘𝐏𝐗F2+λ3𝐏2,1,subscript𝐏superscriptsubscriptnorm𝐘𝐏𝐗𝐹2subscript𝜆3subscriptnorm𝐏21\displaystyle\min_{\mathbf{P}}\|\mathbf{Y}-\mathbf{P}\mathbf{X}\|_{F}^{2}+% \lambda_{3}\|\mathbf{P}\|_{2,1},roman_min start_POSTSUBSCRIPT bold_P end_POSTSUBSCRIPT ∥ bold_Y - bold_PX ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_λ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ∥ bold_P ∥ start_POSTSUBSCRIPT 2 , 1 end_POSTSUBSCRIPT , (11)

where 𝐗=[𝐗1T,,𝐗GT]T(𝐗i=αi𝐗~i)𝐗superscriptsuperscriptsubscript𝐗1𝑇superscriptsubscript𝐗𝐺𝑇𝑇subscript𝐗𝑖subscript𝛼𝑖subscript~𝐗𝑖\mathbf{X}=[\mathbf{X}_{1}^{T},\cdots,\mathbf{X}_{G}^{T}]^{T}~{}(\mathbf{X}_{i% }=\alpha_{i}\tilde{\mathbf{X}}_{i})bold_X = [ bold_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT , ⋯ , bold_X start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT over~ start_ARG bold_X end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ). We use the inexact augmented Lagrangian multiplier (IALM) [39] to learn the optimal 𝐏isubscript𝐏𝑖\mathbf{P}_{i}bold_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.

To be specific, an additional variable, 𝐐𝐐\mathbf{Q}bold_Q satisifying 𝐏=𝐐𝐏𝐐\mathbf{P}=\mathbf{Q}bold_P = bold_Q, is introduced to first convert the original unconstrained optimization problem in Eq.(11) to a constrained one, which can be expressed as follows:

min𝐏,𝐐𝐘𝐐𝐗F2+λ3𝐏2,1,s.t.𝐏=𝐐.formulae-sequencesubscript𝐏𝐐superscriptsubscriptnorm𝐘𝐐𝐗𝐹2subscript𝜆3subscriptnorm𝐏21𝑠𝑡𝐏𝐐\displaystyle\min_{\mathbf{P},\mathbf{Q}}\|\mathbf{Y}-\mathbf{Q}\mathbf{X}\|_{% F}^{2}+\lambda_{3}\|\mathbf{P}\|_{2,1},~{}s.t.~{}\mathbf{P}=\mathbf{Q}.roman_min start_POSTSUBSCRIPT bold_P , bold_Q end_POSTSUBSCRIPT ∥ bold_Y - bold_QX ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_λ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ∥ bold_P ∥ start_POSTSUBSCRIPT 2 , 1 end_POSTSUBSCRIPT , italic_s . italic_t . bold_P = bold_Q . (12)

Subsequently, we are able to obtain the Lagrangian function for Eq.(12), which is formulated as follows:

(𝐏,𝐐,𝐓,κ)=𝐘𝐐𝐗F2+Tr[𝐓T(𝐏𝐐)]𝐏𝐐𝐓𝜅superscriptsubscriptnorm𝐘𝐐𝐗𝐹2𝑇𝑟delimited-[]superscript𝐓𝑇𝐏𝐐\displaystyle\mathcal{L}(\mathbf{P},\mathbf{Q},\mathbf{T},\kappa)=\|\mathbf{Y}% -\mathbf{Q}\mathbf{X}\|_{F}^{2}+Tr[\mathbf{T}^{T}(\mathbf{P}-\mathbf{Q})]~{}~{% }~{}~{}~{}caligraphic_L ( bold_P , bold_Q , bold_T , italic_κ ) = ∥ bold_Y - bold_QX ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_T italic_r [ bold_T start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( bold_P - bold_Q ) ]
+κ2𝐏𝐐F2+λ3𝐏2,1,𝜅2superscriptsubscriptnorm𝐏𝐐𝐹2subscript𝜆3subscriptnorm𝐏21\displaystyle+\frac{\kappa}{2}\|\mathbf{P}-\mathbf{Q}\|_{F}^{2}+\lambda_{3}\|% \mathbf{P}\|_{2,1},+ divide start_ARG italic_κ end_ARG start_ARG 2 end_ARG ∥ bold_P - bold_Q ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_λ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ∥ bold_P ∥ start_POSTSUBSCRIPT 2 , 1 end_POSTSUBSCRIPT , (13)

where 𝐓𝐓\mathbf{T}bold_T is the Lagrangian multiplier matrix, Tr()𝑇𝑟Tr(\cdot)italic_T italic_r ( ⋅ ) represents the trace of a square matrix, and κ𝜅\kappaitalic_κ is a relaxation factor.

Finally, the optimal solution of 𝐏𝐏\mathbf{P}bold_P can be obtained by iteratively minimizing the Lagrangian function in Eq.(13) with respect to one of variables while fixing the others. The detailed updating procedures are summarized in Algorithm 1.

(3) Check Convergence: the value of objective function is less than the machine epsilon value ϵitalic-ϵ\epsilonitalic_ϵ or that the iteration reaches the preset maximal number.

II-D Prediction of Emotion Labels for Target Speech Signals

Once we have obtained the optimal solution, 𝐏^isubscript^𝐏𝑖\hat{\mathbf{P}}_{i}over^ start_ARG bold_P end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and α^^𝛼\hat{\alpha}over^ start_ARG italic_α end_ARG, for AKTLR, we can easily predict the emotion labels of the target speech signals. Let 𝐱t=[𝐱1tT,,𝐱GtT]Tsuperscript𝐱𝑡superscriptsuperscriptsubscriptsuperscript𝐱𝑡1𝑇superscriptsubscriptsuperscript𝐱𝑡𝐺𝑇𝑇\mathbf{x}^{t}=[{\mathbf{x}^{t}_{1}}^{T},\cdots,{\mathbf{x}^{t}_{G}}^{T}]^{T}bold_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = [ bold_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT , ⋯ , bold_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT be the feature vector of a target speech sample. We first predict its emotion label vector 𝐲^tsuperscript^𝐲𝑡\hat{\mathbf{y}}^{t}over^ start_ARG bold_y end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT by solving the following optimization problem:

min𝐲t𝐲ti=1Gα^i𝐏^i𝐱itF2,s.t.,𝐲t0,𝟏T𝐲t=1.\displaystyle\min_{\mathbf{y}^{t}}\|\mathbf{y}^{t}-\sum_{i=1}^{G}\hat{\alpha}_% {i}\hat{\mathbf{P}}_{i}\mathbf{x}_{i}^{t}\|_{F}^{2},~{}s.t.,~{}\mathbf{y}^{t}% \succeq 0,~{}\mathbf{1}^{T}\mathbf{y}^{t}=1.roman_min start_POSTSUBSCRIPT bold_y start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∥ bold_y start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT over^ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT over^ start_ARG bold_P end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , italic_s . italic_t . , bold_y start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ⪰ 0 , bold_1 start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_y start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = 1 . (14)

This is a standard quadratic programming problem and can be effectively solved using the interior point method. Then, based on 𝐲^tsuperscript^𝐲𝑡\hat{\mathbf{y}}^{t}over^ start_ARG bold_y end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT, the emotion label of its corresponding target speech signal can be determined as the jthsuperscript𝑗𝑡j^{th}italic_j start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT emotion, which satisfies the following criterion:

j=argmaxj{𝐲^t(j)|j=1,,C},𝑗subscript𝑗conditionalsuperscript^𝐲𝑡𝑗𝑗1𝐶\displaystyle j=\arg\max_{j}\{\hat{\mathbf{y}}^{t}(j)~{}|~{}j=1,\cdots,C\},italic_j = roman_arg roman_max start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT { over^ start_ARG bold_y end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_j ) | italic_j = 1 , ⋯ , italic_C } , (15)

where 𝐲^t(j)superscript^𝐲𝑡𝑗\hat{\mathbf{y}}^{t}(j)over^ start_ARG bold_y end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_j ) represents the jthsuperscript𝑗𝑡j^{th}italic_j start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT entry in the predcted emotion label vector 𝐲^tsuperscript^𝐲𝑡\hat{\mathbf{y}}^{t}over^ start_ARG bold_y end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT.

III Experiments

III-A Experiment Setup

In this section, we evaluate the performance of the proposed AKTLR method through extensive cross-corpus SER experiments. We provide details of our experiment setup, including: 1) Speech Emotion Corpora, 2) Experimental Protocol, 3) Performance Metric, and 4) Comparison Methods and Implementation Details.

III-A1 Speech Emotion Corpora

We utilize three publicly available speech emotion corpora in our experiments. Here is a brief overview of these corpora:

EmoDB [33]: This German speech emotion corpus consists of 535 speech samples. Each sample corresponds to a sentence uttered in German under one of seven emotional states (Anger, Boredom, Disgust, Fear, Happiness, Neutral, and Sadness) by one of 10 professional German actresses/actors (five actresses and five actors).

eNTERFACE [34]: Unlike EmoDB, eNTERFACE is a bimodal emotion database containing 1,257 video clips with both speech and facial expressions. Each video clip is labeled with one of six basic emotions (Anger, Disgust, Fear, Happiness, Sadness, and Surprise). For the design of our cross-corpus SER tasks, only the speech data is used.

CASIA [35]: This is a large-scale Chinese speech emotion corpus comprising 9,600 speech samples. In our experiments, we utilize its freely released version, which includes 1,200 speech samples from four speakers (two females and two males), with each speech sample conveying one of six different emotions (Anger, Fear, Happiness, Neutral, Sadness, and Surprise).

III-A2 Experimental Protocol

We used the aforementioned three speech emotion corpora to create six cross-corpus SER tasks: BE𝐵𝐸B\rightarrow Eitalic_B → italic_E, EB𝐸𝐵E\rightarrow Bitalic_E → italic_B, BC𝐵𝐶B\rightarrow Citalic_B → italic_C, CB𝐶𝐵C\rightarrow Bitalic_C → italic_B, EC𝐸𝐶E\rightarrow Citalic_E → italic_C, and CE𝐶𝐸C\rightarrow Eitalic_C → italic_E. Here, B𝐵Bitalic_B, E𝐸Eitalic_E, and C𝐶Citalic_C represent EmoDB, eNTERFACE, and CASIA, respectively. The corpora listed on either side of the arrow indicate the source and target speech emotion corpora in their respective cross-corpus SER tasks. It is important to note that due to inconsistencies in emotion labels across the three speech emotion corpora, only speech samples with matching emotion labels are chosen for their corresponding tasks. For a more comprehensive understanding of these cross-corpus SER tasks, detailed data composition for all the speech emotion corpora is presented in Table I.

TABLE I: Detailed Sample Composition for All Three Speech Emotion Corpora Used in the Experiments.
Cross-Corpus SER Task Bnormal-→\rightarrowE / Enormal-→\rightarrowB Bnormal-→\rightarrowC / Cnormal-→\rightarrowB Enormal-→\rightarrowC / Cnormal-→\rightarrowE
   EmoDB eNTERFACE    EmoDB     CASIA eNTERFACE     CASIA
Sample Number Anger 127 211 127 200 211 200
Fear 69 211 69 200 211 200
Disgust 46 211 - - - -
Happiness 71 208 71 200 208 200
Neutral - - 79 200 - -
Sadness 62 211 62 200 211 200
Surprise - - - - 211 200
Total Number 375 1,052 408 1,000 1,052 1,000

III-A3 Performance Metric

We have chosen the unweighted average recall (UAR) [10] as the performance metric for our experiments. UAR is computed by averaging the accuracy across the total number of emotion classes. It is calculated using the formula UAR =1Ci=1CNipNig×100absent1𝐶superscriptsubscript𝑖1𝐶superscriptsubscript𝑁𝑖𝑝superscriptsubscript𝑁𝑖𝑔100=\frac{1}{C}\sum_{i=1}^{C}\frac{N_{i}^{p}}{N_{i}^{g}}\times 100= divide start_ARG 1 end_ARG start_ARG italic_C end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT divide start_ARG italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT end_ARG × 100. Here, C𝐶Citalic_C is the number of total emotion classes involved in the cross-corpus SER task, and Nipsuperscriptsubscript𝑁𝑖𝑝N_{i}^{p}italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT and Nigsuperscriptsubscript𝑁𝑖𝑔N_{i}^{g}italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT represent the number of samples predicted as the ithsuperscript𝑖𝑡i^{th}italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT emotion and the actual number of ithsuperscript𝑖𝑡i^{th}italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT emotion samples, respectively.

III-A4 Comparison Methods and Implementation Details

To highlight the effectiveness and superior performance of our AKTLR method in addressing the challenge of cross-corpus SER, we compare it with five recent state-of-the-art (SOTA) Transfer Subspace Learning methods and six SOTA Deep Transfer Learning methods. The methods included in the comparison and their implementation details are as follows:

Transfer Subspace Learning Methods include transfer component analysis (TCA) [40], geodesic flow kernel (GFK) [41], subspace alignment (SA) [42], domain-adaptive subspace learning (DoSL) [43], and joint distribution adaptive regression (JDAR) [44]. In these methods, two widely-used acoustic parameter feature sets, namely INTERSPEECH 2009 Emotion Challenge (IS09) [45] and the extended Geneva minimalistic acoustic parameter set (eGeMAPS) [32], are utilized to describe speech signals. Both feature sets consist of low-level descriptors (LLDs) such as F0 and MFCC through typical statistical functions. The openSMILE toolkit [46] is used to extract these feature sets from the speech signals. For the experiments, linear support vector machine (SVM) [47] is used as the classifier for all subspace learning methods without classification ability, including TCA, GFK, and SA. Additionally, the results of directly using SVM to conduct all cross-corpus SER experiments are included as the baseline.

Since emotion label information is unavailable in the tasks of cross-corpus SER, we follow the tradition of transfer learning evaluation. Therefore, we report the best results of the five transfer subspace learning methods by searching their hyper-parameters from a given interval. Specifically, TCA, GFK, and SA aim to learn a d𝑑ditalic_d-dimensional common subspace for both source and target speech samples, where d𝑑ditalic_d is set within a predetermined parameter interval, [1:dmax]delimited-[]:1subscript𝑑𝑚𝑎𝑥[1:d_{max}][ 1 : italic_d start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT ], and dmaxsubscript𝑑𝑚𝑎𝑥d_{max}italic_d start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT represents the number of elements in the acoustic parameter set used in the experiments. DoSL and JDAR require setting two trade-off parameters, λ𝜆\lambdaitalic_λ and μ𝜇\muitalic_μ, which control the balance between the sparsity and feature distribution elimination terms and the original regression loss function. In the experiments, λ𝜆\lambdaitalic_λ and μ𝜇\muitalic_μ are determined by searching within the range of [1:100]delimited-[]:1100[1:100][ 1 : 100 ].

Deep Transfer Learning Methods including deep adaptation network (DAN) [48], joint adaptation network (JAN) [49], deep subdomain adaptation network (DSAN) [50], domain-adversarial neural network (DANN) [51], conditional domain adversarial network (CDAN) [52], and DIDAN [26], are utilized in the comparison experiments. The speech signals are first tranformed into the Mel-spectrograms and then resized to 224×224224224224\times 224224 × 224 pixels, serving as the input for deep neural networks. In this comparison, VGG-11 [53] is chosen as the CNN backbone of all the deep transfer learning methods, and its experimental results are included as the baseline. The optimizer, learning rate, weight decay, and batch size are set as SGD, 1e21superscript𝑒21e^{-2}1 italic_e start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT, 5e45superscript𝑒45e^{-4}5 italic_e start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT, and 32323232, respectively, for the VGG-11 and comparison deep transfer learning methods. The trade-off parameter settings for all deep transfer learning methods are as follows:

DAN, JAN, DSAN, DANN, and CDAN have a trade-off parameter λ𝜆\lambdaitalic_λ in their loss functions, which balances the original loss function and the feature distribution alleviation term. In the experiments, λ𝜆\lambdaitalic_λ is searched within the parameter interval [0.0001:0.0001:0.001,0.002:0.001:0.01,0.02:0.01:0.1,0.2:0.1:1,2,5,10,100]delimited-[]:0.00010.0001:0.0010.002:0.001:0.010.02:0.01:0.10.2:0.1:12510100[0.0001:0.0001:0.001,0.002:0.001:0.01,0.02:0.01:0.1,0.2:0.1:1,2,5,10,100][ 0.0001 : 0.0001 : 0.001 , 0.002 : 0.001 : 0.01 , 0.02 : 0.01 : 0.1 , 0.2 : 0.1 : 1 , 2 , 5 , 10 , 100 ]. Besides λ𝜆\lambdaitalic_λ, DIDAN has an additional trade-off parameter, α𝛼\alphaitalic_α, which controls the sparsity of its learned reconstruction coefficient matrix. For DIDAN, λ𝜆\lambdaitalic_λ and α𝛼\alphaitalic_α are also searched within the same intervals as the other five deep transfer learning methods: [0.0001:0.0001:0.001,0.002:0.001:0.01,0.02:0.01:0.1,0.2:0.1:1,2,5,10,100]delimited-[]:0.00010.0001:0.0010.002:0.001:0.010.02:0.01:0.10.2:0.1:12510100[0.0001:0.0001:0.001,0.002:0.001:0.01,0.02:0.01:0.1,0.2:0.1:1,2,5,10,100][ 0.0001 : 0.0001 : 0.001 , 0.002 : 0.001 : 0.01 , 0.02 : 0.01 : 0.1 , 0.2 : 0.1 : 1 , 2 , 5 , 10 , 100 ].

Our AKTLR has three trade-off parameters: λ1subscript𝜆1\lambda_{1}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and λ3subscript𝜆3\lambda_{3}italic_λ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT. In our experments, we conduct a search for λ1subscript𝜆1\lambda_{1}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and λ3subscript𝜆3\lambda_{3}italic_λ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT in the parameter interval of [1:100]delimited-[]:1100[1:100][ 1 : 100 ], while λ2subscript𝜆2\lambda_{2}italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is searched winthin the range of [0.1:0.1:1]delimited-[]:0.10.1:1[0.1:0.1:1][ 0.1 : 0.1 : 1 ]. Additionally, we divide both IS09 and eGeMAPS feature sets into 10 LLD groups based on the acoustic parameter type. For further details, please refer to Table II.

TABLE II: Configuration of LLD Groups for AKTLR Using IS09 and eGeMAPS Feature Sets to Describe Speech Signals (element numbers are indicated in parentheses).
Feature Set LLD Groups
IS09 ZCR (12), ΔΔ\Deltaroman_ΔZCR (12), F0 (12), ΔΔ\Deltaroman_ΔF0 (12),
RMS Energy (12), ΔΔ\Deltaroman_ΔRMS Energy (12), HNR (12),
ΔΔ\Deltaroman_ΔHNR (12), MFCC (144), ΔΔ\Deltaroman_ΔMFCC (144)
eGeMAPS F0 (18), Loudness (16), Spectral Flux (5),
Formant (18), Hammarberg Index (3), MFCC (16),
Spectral Slope (6), Alpha Ratio (3), HNR (2),
Equivalent Sound Level (1)
TABLE III: Comparison of the Proposed AKTLR Method and Recent State-of-the-Art Transfer Learning Methods for Cross-Corpus SER Tasks. The Best Result in Each Task is Highlighted in Bold.
Method 𝐁𝐄𝐁𝐄\textbf{B}\rightarrow\textbf{E}B → E 𝐄𝐁𝐄𝐁\textbf{E}\rightarrow\textbf{B}E → B 𝐁𝐂𝐁𝐂\textbf{B}\rightarrow\textbf{C}B → C 𝐂𝐁𝐂𝐁\textbf{C}\rightarrow\textbf{B}C → B 𝐄𝐂𝐄𝐂\textbf{E}\rightarrow\textbf{C}E → C 𝐂𝐄𝐂𝐄\textbf{C}\rightarrow\textbf{E}C → E Average
Subspace Learning (IS09 Feature Set) SVM   28.93   23.58   29.60   35.01   26.10   25.14   28.06
TCA 30.73 45.16 33.40 45.82 31.80 34.12 36.84
GFK 32.40 45.42 35.60 51.19 32.90 29.54 37.84
SA 33.50 45.78 36.90 48.48 32.80 32.71 38.36
DoSL 36.29 39.84 34.60 46.14 30.90 31.69 36.58
JDAR 37.10 40.78 33.10 47.34 32.40 31.50 37.04
Subspace Learning (eGeMAPS Feautre Set) SVM 25.65 32.58 33.50 51.84 36.40 34.79 35.96
TCA 31.09 37.43 42.90 53.43 41.10 35.90 40.31
GFK 30.08 35.79 40.00 50.79 39.20 34.48 38.39
SA 32.18 39.37 38.80 53.20 37.00 35.43 39.33
DoSL 30.81 40.71 39.30 52.21 39.10 34.27 39.40
JDAR 31.41 45.19 42.30 56.14 38.40 33.62 41.18
Deep Learning VGG-11 27.08 34.83 34.80 51.31 26.90 26.02 33.49
DAN 33.58 43.50 36.30 56.72 29.30 32.17 38.60
JAN 35.23 47.29 37.00 57.51 31.00 32.21 40.04
DSAN 31.82 47.58 35.58 56.50 29.00 31.25 38.66
DANN 32.56 46.06 36.40 57.67 30.50 33.77 39.49
CDAN 31.62 46.12 35.40 57.60 30.30 33.49 39.09
DIDAN 33.05 47.11 38.90 56.22 31.10 34.06 40.07
Subspace Learning AKTLR (IS09) 37.51 47.12 37.00 47.61 30.60 33.11 38.83
AKTLR (eGeMAPS) 32.51 43.60 45.00 59.93 37.60 34.09 42.12

III-B Comparison with State-of-the-Art Cross-Corpus SER Methods

The experimental results for all transfer learning methods are presented in Table III. Several noteworthy observations can be made from this table:

(1) It is evident from Table III that both transfer subspace learning and deep transfer learning methods exhibit promising performance improvements compared to their respective baseline methods (SVM or VGG-11) in all six cross-corpus SER tasks. Particularly interesting is the consistent enhancement observed in transfer subspace methods, regardless of the choice of acoustic parameter feature sets (IS09 or eGeMAPS) used to describe speech signals. In summary, our experimental results strongly indicate the potential of transfer learning as a promising approach to effectively address the challenge of cross-corpus SER.

(2) The performance comparison of transfer subspace learning methods using the IS09 (16 LLDs yielding 384 features) and eGeMAPS feature sets (five meticulously chosen LLDs yielding 88 features) reveals that the eGeMAPS feature set significantly improves cross-corpus SER performance compared to IS09. This finding underscores the importance of selecting minimalistic high-quality acoustic parameters capable of exhibiting superior generalization ability to corpus invariance when employing transfer subspace learning methods to address cross-corpus SER tasks. Our results provide additional experimental evidence to support this established knowledge in SER [31, 11, 32], which motivates the design of our AKTLR method.

(3) As shown in the table, our AKTLR, utilizing the eGeMAPS feature set, achieves the highest UAR among all transfer learning methods, averaging a UAR of 42.12%percent42.1242.12\%42.12 % across the six cross-corpus SER tasks. Furthermore, our AKTLR outperforms all other methods in two out of the six tasks, namely BC𝐵𝐶B\rightarrow Citalic_B → italic_C and CB𝐶𝐵C\rightarrow Bitalic_C → italic_B. While AKTLR may not achieve the best performance in the remaining four tasks, it still demonstrates a very competitive performance compared to all other transfer learning methods. In summary, these observations highlight the superior performance of our AKTLR method in addressing the challenge of cross-corpus SER, surpassing both recent SOTA transfer subspace learning and deep transfer learning methods. This also demonstrates the feasibility and superiority of incorporating acoustic knowledge to develop a domain-specific cross-corpus SER approach for dealing with cross-corpus SER tasks.

TABLE IV: Detailed Configuration of Additional LLD Group Settings for eGeMAPS Feature Set. The Number of Elements is Given in Parentheses.
#LLD Groups Details of LLD Groups
4 Groups Frequency (30), Energy (20),
Spectral (37), Equivalent Sound Level (1)
13 Groups F0 (10), Jitter (2), Formant (18), Spectral Slope (6),
MFCC (16), Alpha Ratio (3), Shimmer (2),
Hammarberg (3), HNR (2), Harmonic Difference (4),
Spectral Flux (5), MFCC (16), Londness (16),
Equivalent Sound Level (1)

III-C A Deeper Look at the Proposed AKTLR Method

This section aims to provide a comprehensive understanding of the proposed AKTLR method. We will address three key questions to delve into AKTLR: 1) Does AKTLR truly benefit from the incorporation of the selected acoustic knowledge? 2) What can AKTLR learn guided by the selected acoustic knowledge? 3) How does the performance of AKTLR vary with changes in the trade-off parameter?. To answer these questions, we will conduct additional cross-corpus SER experiments using AKTLR, with the aim of offering comprehensive insights into its effectiveness and advantages.

TABLE V: The experimental results of state-of-the-art methods using eGeMAPS feature set on six cross-corpus SER tasks. (%)
Method 𝐁𝐄𝐁𝐄\textbf{B}\rightarrow\textbf{E}B → E 𝐁𝐂𝐁𝐂\textbf{B}\rightarrow\textbf{C}B → C 𝐄𝐂𝐄𝐂\textbf{E}\rightarrow\textbf{C}E → C
AKTLR w/o α1subscriptnorm𝛼1\|\alpha\|_{1}∥ italic_α ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT (No Group)   30.08   39.90   39.10
AKTLR (4 Groups) 33.29 43.40 39.70
AKTLR (10 Groups) 32.51 45.00 37.60
AKTLR (13 Groups) 32.51 42.50 37.60
Refer to caption
Figure 2: The bar charts for the learned α𝛼\alphaitalic_α by AKTLR, depicting the specific contributions of their corresponding acoustic parameter derived features for cross-corpus SER. (a), (b), and (c) correspond to Tasks BE𝐵𝐸B\rightarrow Eitalic_B → italic_E, BC𝐵𝐶B\rightarrow Citalic_B → italic_C, and EC𝐸𝐶E\rightarrow Citalic_E → italic_C.

III-C1 Does AKTLR Truly Benefit From the Incorporation of the Selected Acoustic Knowledge?

To address this question, we conduct additional experiments on three representative cross-corpus SER tasks: BE𝐵𝐸B\rightarrow Eitalic_B → italic_E, BC𝐵𝐶B\rightarrow Citalic_B → italic_C, and EC𝐸𝐶E\rightarrow Citalic_E → italic_C. Specifically, we utilize the eGeMPAS feature set to describe speech signals, which is divided into two additional LLD groups: G=4𝐺4G=4italic_G = 4 and G=13𝐺13G=13italic_G = 13 for AKTLR, different from the previous experiments where G=10𝐺10G=10italic_G = 10. The detailed configuration of LLD group settings can be found in Table IV. In these experiments, we also remove the regularization term α1subscriptnorm𝛼1\|\alpha\|_{1}∥ italic_α ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT from the objective function of AKTLR, resulting in a reduced version of AKTLR that alighs with the objective function of DoSL [43]. Thus, this reduced version can be viewed as AKTLR without specially considering the different contributions of LLDs, denoted as AKTLR w/o α1subscriptnorm𝛼1\|\alpha\|_{1}∥ italic_α ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT (No Group). The experimental results, presented in Table V, reveal several interesting observations that provide an experimental answer to this question.

Refer to caption
Figure 3: The experimental results of trade-off parameter sensitivity analysis for our proposed AKTLR in addressing the tasks of cross-corpus SER, where (a), (b), and (c) correspond to the results of changing λ1subscript𝜆1\lambda_{1}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, λ2subscript𝜆2\lambda_{2}italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, and λ3subscript𝜆3\lambda_{3}italic_λ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT while fixing others.

Firstly, it is evident that our AKTLR models, which adopt different LLD group settings, achieve better performance in terms of UAR compared to AKTLR without setting LLD groups. This observation demonstrates the feasibility and superiority of the concept behind our proposed AKTLR, i.e., ”selecting these acoustic parameters may enable the transfer subspace learning models to achieve more promising recognition performance in cross-corpus SER tasks compared to directly using larger feature sets comprising comprehensive acoustic parameters”. Guided by this acoustic knowledge, AKTLR divides the acoustic parameter feature set into different LLD groups and measures their contribution scores, ensuring the learning of both emotion-discriminative and corpus-invariant features.

Secondly, it is worth noting that the AKTLR models with 10 and 13 groups perform worse than AKTLR without without setting LLD groups in the task of EC𝐸𝐶E\rightarrow Citalic_E → italic_C. We believe that this is mainly due to the use of an excessive LLD groups in these cases. By comparing the different groups used for various cross-corpus SER tasks, it becomes apparent that the overall performance of AKTLR decreases with an increase in the number of groups. This supports our previous supposition. In other words, determining a suitable LLD group setting remains an open question for our AKTLR method in tackling the challenge of cross-corpus SER.

III-C2 What Can AKTLR Learn When Guided by the Selected Acoustic Knowledge?

Our proposed AKTLR benefits from the incorporation of established acoustic knowledge into its design. By dividing the acoustic parameter feature set into different LLD groups and measuring their contribution scores, AKTLR model is more capable of seeking a minimalistic high-quality features that are emotion-discriminative features and corpus-invariant. This approach inspires us to explore what AKTLR can learn when guided by the utilization of acoustic knowledge. To this end, we present a set of bar charts in Fig. 2, illustrating the αisubscript𝛼𝑖\alpha_{i}italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT values learned by AKTLR when utilizing the eGeMAPS feature set with different LLD groups to address three representative cross-corpus SER experiments in Table V.

The findings from Fig. 2 are quite intriguing. Firstly, it is evident that different LLD groups exhibit varying contributions when addressing cross-corpus SER tasks. Specifically, in five out of the nine cross-corpus SER experiments, certain acoustic parameters (corresponding to 0-valued αisubscript𝛼𝑖\alpha_{i}italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT) show negligible contribution in distinguishing emotions across speech corpora. These observations provide experimental evidence that supports selected acoustic knowledge guiding the design of the proposed AKTLR [31, 11, 32]. This implies that selecting minimalistic high-quality acoustic parameters is necessary and sufficient for dealing with the cross-corpus SER tasks.

Secondly, upon further examination of the contributive LLD groups, it becomes apparent that the contributions of several acoustic parameters vary across different cross-corpus SER tasks, exhibiting high scores in some tasks while low scores in others. This suggests that there are no consistently highly-contributive acoustic parameters for all the cross-corpus SER tasks. However, it is interesting to note the presence of several ”stable” (varied but consistently contributive) emotion-discriminative and corpus-invariant acoustic parameters, such as MFCC, which consistently exhibit a satisfactory learned score. This insight inspires us to consider the possibility of testing and selecting acoustic parameters to develop a general minimalistic acoustic parameter feature set consisting of high-quality elements that are consistently emotion-discriminative and corpus-invariant. Such a set could potentially enhance the performance of transfer learning methods in addressing the challenge of cross-corpus SER.

III-C3 How Trade-off Parameters Affect the Performance of AKTLR?

In Eq.(4), our AKTLR requires to set three trade-off parameters: λ1subscript𝜆1\lambda_{1}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, λ2subscript𝜆2\lambda_{2}italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, and λ3subscript𝜆3\lambda_{3}italic_λ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT. This raises the question of how the choice of these trade-off parameters affect the performance of AKTLR in addressing the challenge of cross-corpus SER. To investigate this point, we continue to conduct experiments using the eGeMAPS feature set on three cross-corpus SER tasks chosen above: BE𝐵𝐸B\rightarrow Eitalic_B → italic_E, BC𝐵𝐶B\rightarrow Citalic_B → italic_C, and EC𝐸𝐶E\rightarrow Citalic_E → italic_C . We change the value of one trade-off parameter while keeping the others fixed, and monitor the experimental results of AKTLR. The intervals for the trade-off parameter values are set as [10:10:100]delimited-[]:1010:100[10:10:100][ 10 : 10 : 100 ] for both λ1subscript𝜆1\lambda_{1}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and λ3subscript𝜆3\lambda_{3}italic_λ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT, and [0.1:0.1:1]delimited-[]:0.10.1:1[0.1:0.1:1][ 0.1 : 0.1 : 1 ] for λ2subscript𝜆2\lambda_{2}italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. The fixed values for λ1subscript𝜆1\lambda_{1}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, λ2subscript𝜆2\lambda_{2}italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, and λ3subscript𝜆3\lambda_{3}italic_λ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT are those used in the experiments described in Section III-B.

The results are illustrated in Figure 3. From this figure, it is evident that the performance of our AKTLR varies slightly with respect to the choice of λ1subscript𝜆1\lambda_{1}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and λ2subscript𝜆2\lambda_{2}italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT across all three cross-corpus SER tasks. However, in the case of λ3subscript𝜆3\lambda_{3}italic_λ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT, although the performance of AKTLR appears to be sensitive to changes in its value, AKTLR consistently performs within an acceptable range around the fixed value used in the experiments. In summary, we can conclude that the performance of our AKTLR is generally less sensitive to the choice of its associated trade-off parameters.

IV Conclusion

In this paper, we have addressed the challenge of cross-corpus SER from a new perspective by introducing a novel transfer subspace learning method called AKTLR. The primary contribution of AKTLR lies in its acoustic knowledge-guided dual sparsity constraint mechanism, which enables more effective learning of emotion-discriminative and corpus-invariant features at two different scales: acoustic parameter and feature. Compared with existing transfer subspace learning-based cross-corpus SER methods, AKTLR is the first domain-specific approach designed specifically under the guidance of established acoustic knowledge for cross-corpus SER. To evaluate the effectiveness of AKTLR, we conduct extensive cross-corpus SER experiments using three widely-used speech emotion corpora. The results demonstrate that AKTLR outperforms current SOTA transfer subspace learning and deep transfer learning-based cross-corpus SER methods. This confirms the efficacy and feasibility of leveraging acoustic knowledge to develop domain-specific transfer learning methods for cross-corpus SER.

References

  • [1] M. B. Akçay and K. Oğuz, “Speech emotion recognition: Emotional models, databases, features, preprocessing methods, supporting modalities, and classifiers,” Speech Communication, vol. 116, pp. 56–76, 2020.
  • [2] Y. B. Singh and S. Goel, “A systematic literature review of speech emotion recognition approaches,” Neurocomputing, vol. 492, pp. 245–263, 2022.
  • [3] J. Wagner, A. Triantafyllopoulos, H. Wierstorf, M. Schmitt, F. Burkhardt, F. Eyben, and B. W. Schuller, “Dawn of the transformer era in speech emotion recognition: closing the valence gap,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
  • [4] T. L. Nwe, S. W. Foo, and L. C. De Silva, “Speech emotion recognition using hidden markov models,” Speech communication, vol. 41, no. 4, pp. 603–623, 2003.
  • [5] Z. Huang, M. Dong, Q. Mao, and Y. Zhan, “Speech emotion recognition using cnn,” in Proceedings of the 22nd ACM international conference on Multimedia, 2014, pp. 801–804.
  • [6] H. M. Fayek, M. Lech, and L. Cavedon, “Evaluating deep learning architectures for speech emotion recognition,” Neural Networks, vol. 92, pp. 60–68, 2017.
  • [7] C. Lu, W. Zheng, H. Lian, Y. Zong, C. Tang, S. Li, and Y. Zhao, “Speech emotion recognition via an attentive time–frequency neural network,” IEEE Transactions on Computational Social Systems, 2022.
  • [8] C. Lu, Y. Zong, W. Zheng, Y. Li, C. Tang, and B. W. Schuller, “Domain invariant feature learning for speaker-independent speech emotion recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 2217–2230, 2022.
  • [9] S. Zhang, X. Zhao, and Q. Tian, “Spontaneous speech emotion recognition using multiscale deep convolutional lstm,” IEEE Transactions on Affective Computing, vol. 13, no. 2, pp. 680–688, 2022.
  • [10] B. Schuller, B. Vlasenko, F. Eyben, M. Wöllmer, A. Stuhlsatz, A. Wendemuth, and G. Rigoll, “Cross-corpus acoustic emotion recognition: Variances and strategies,” IEEE Transactions on Affective Computing, vol. 1, no. 2, pp. 119–131, 2010.
  • [11] C. Parlak, B. Diri, and F. Gürgen, “A cross-corpus experiment in speech emotion recognition.” in SLAM@ INTERSPEECH, 2014, pp. 58–61.
  • [12] S. J. Pan and Q. Yang, “A survey on transfer learning,” IEEE Transactions on knowledge and data engineering, vol. 22, no. 10, pp. 1345–1359, 2009.
  • [13] S. Niu, Y. Liu, J. Wang, and H. Song, “A decade survey of transfer learning (2010–2020),” IEEE Transactions on Artificial Intelligence, vol. 1, no. 2, pp. 151–166, 2020.
  • [14] A. Hassan, R. Damper, and M. Niranjan, “On acoustic emotion recognition: compensating for covariate shift,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 21, no. 7, pp. 1458–1468, 2013.
  • [15] C. Cortes and V. Vapnik, “Support-vector networks,” Machine learning, vol. 20, pp. 273–297, 1995.
  • [16] A. Gretton, A. Smola, J. Huang, M. Schmittfull, K. Borgwardt, B. Schölkopf et al., “Covariate shift by kernel mean matching,” Dataset shift in machine learning, vol. 3, no. 4, p. 5, 2009.
  • [17] T. Kanamori, S. Hido, and M. Sugiyama, “A least-squares approach to direct importance estimation,” The Journal of Machine Learning Research, vol. 10, pp. 1391–1445, 2009.
  • [18] Y. Tsuboi, H. Kashima, S. Hido, S. Bickel, and M. Sugiyama, “Direct density ratio estimation for large-scale covariate shift adaptation,” Journal of Information Processing, vol. 17, pp. 138–155, 2009.
  • [19] P. Song, W. Zheng, S. Ou, X. Zhang, Y. Jin, J. Liu, and Y. Yu, “Cross-corpus speech emotion recognition based on transfer non-negative matrix factorization,” Speech Communication, vol. 83, pp. 34–41, 2016.
  • [20] K. M. Borgwardt, A. Gretton, M. J. Rasch, H.-P. Kriegel, B. Schölkopf, and A. J. Smola, “Integrating structured biological data by kernel maximum mean discrepancy,” Bioinformatics, vol. 22, no. 14, pp. e49–e57, 2006.
  • [21] H. Luo and J. Han, “Nonnegative matrix factorization based transfer subspace learning for cross-corpus speech emotion recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 2047–2060, 2020.
  • [22] J. Parry, D. Palaz, G. Clarke, P. Lecomte, R. Mead, M. Berger, and G. Hofer, “Analysis of deep learning architectures for cross-corpus speech emotion recognition.” in Interspeech, 2019, pp. 1656–1660.
  • [23] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” Advances in neural information processing systems, vol. 25, 2012.
  • [24] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
  • [25] Y. Zhao, J. Wang, R. Ye, Y. Zong, W. Zheng, and L. Zhao, “Deep transductive transfer regression network for cross-corpus speech emotion recognition,” Proceedings of the INTERSPEECH, Incheon, Korea, pp. 18–22, 2022.
  • [26] Y. Zhao, J. Wang, Y. Zong, W. Zheng, H. Lian, and L. Zhao, “Deep implicit distribution alignment networks for cross-corpus speech emotion recognition,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2023, pp. 1–5.
  • [27] J. Gideon, M. G. McInnis, and E. M. Provost, “Improving cross-corpus speech emotion recognition with adversarial discriminative domain generalization (addog),” IEEE Transactions on Affective Computing, vol. 12, no. 4, pp. 1055–1068, 2019.
  • [28] Y. Gao, S. Okada, L. Wang, J. Liu, and J. Dang, “Domain-invariant feature learning for cross corpus speech emotion recognition,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2022, pp. 6427–6431.
  • [29] Y. Gao, L. Wang, J. Liu, J. Dang, and S. Okada, “Adversarial domain generalized transformer for cross-corpus speech emotion recognition,” IEEE Transactions on Affective Computing, 2023.
  • [30] D. H. Wolpert and W. G. Macready, “No free lunch theorems for optimization,” IEEE transactions on evolutionary computation, vol. 1, no. 1, pp. 67–82, 1997.
  • [31] C. E. Williams and K. N. Stevens, “Emotions and speech: Some acoustical correlates,” The journal of the acoustical society of America, vol. 52, no. 4B, pp. 1238–1250, 1972.
  • [32] F. Eyben, K. R. Scherer, B. W. Schuller, J. Sundberg, E. André, C. Busso, L. Y. Devillers, J. Epps, P. Laukka, S. S. Narayanan et al., “The geneva minimalistic acoustic parameter set (gemaps) for voice research and affective computing,” IEEE transactions on affective computing, vol. 7, no. 2, pp. 190–202, 2015.
  • [33] F. Burkhardt, A. Paeschke, M. Rolfes, W. F. Sendlmeier, B. Weiss et al., “A database of german emotional speech.” in Interspeech, vol. 5, 2005, pp. 1517–1520.
  • [34] O. Martin, I. Kotsia, B. Macq, and I. Pitas, “The enterface’05 audio-visual emotion database,” in 22nd International Conference on Data Engineering Workshops (ICDEW’06).   IEEE, 2006, pp. 8–8.
  • [35] J. Zhang and H. Jia, “Design of speech corpus for mandarin text to speech,” in The blizzard challenge 2008 workshop, 2008.
  • [36] G. Liu, Z. Lin, S. Yan, J. Sun, Y. Yu, and Y. Ma, “Robust recovery of subspace structures by low-rank representation,” IEEE transactions on pattern analysis and machine intelligence, vol. 35, no. 1, pp. 171–184, 2012.
  • [37] W. Zheng, “Multi-view facial expression recognition based on group sparse reduced-rank regression,” IEEE Transactions on Affective Computing, vol. 5, no. 1, pp. 71–85, 2014.
  • [38] J. Liu, S. Ji, J. Ye et al., “Slep: Sparse learning with efficient projections,” Arizona State University, vol. 6, no. 491, p. 7, 2009.
  • [39] Z. Lin, M. Chen, and Y. Ma, “The augmented lagrange multiplier method for exact recovery of corrupted low-rank matrices,” arXiv preprint arXiv:1009.5055, 2010.
  • [40] S. J. Pan, I. W. Tsang, J. T. Kwok, and Q. Yang, “Domain adaptation via transfer component analysis,” IEEE transactions on neural networks, vol. 22, no. 2, pp. 199–210, 2010.
  • [41] B. Gong, Y. Shi, F. Sha, and K. Grauman, “Geodesic flow kernel for unsupervised domain adaptation,” in 2012 IEEE conference on computer vision and pattern recognition.   IEEE, 2012, pp. 2066–2073.
  • [42] B. Fernando, A. Habrard, M. Sebban, and T. Tuytelaars, “Unsupervised visual domain adaptation using subspace alignment,” in Proceedings of the IEEE international conference on computer vision, 2013, pp. 2960–2967.
  • [43] N. Liu, Y. Zong, B. Zhang, L. Liu, J. Chen, G. Zhao, and J. Zhu, “Unsupervised cross-corpus speech emotion recognition using domain-adaptive subspace learning,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2018, pp. 5144–5148.
  • [44] J. Zhang, L. Jiang, Y. Zong, W. Zheng, and L. Zhao, “Cross-corpus speech emotion recognition using joint distribution adaptive regression,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2021, pp. 3790–3794.
  • [45] B. Schuller, S. Steidl, and A. Batliner, “The interspeech 2009 emotion challenge,” in Proc. Interspeech 2009, Brighton, UK, 2009, pp. 312–315.
  • [46] F. Eyben, M. Wöllmer, and B. Schuller, “Opensmile: the munich versatile and fast open-source audio feature extractor,” in Proceedings of the 18th ACM international conference on Multimedia, 2010, pp. 1459–1462.
  • [47] C.-C. Chang and C.-J. Lin, “Libsvm: a library for support vector machines,” ACM transactions on intelligent systems and technology (TIST), vol. 2, no. 3, pp. 1–27, 2011.
  • [48] M. Long, Y. Cao, J. Wang, and M. Jordan, “Learning transferable features with deep adaptation networks,” in International conference on machine learning.   PMLR, 2015, pp. 97–105.
  • [49] M. Long, H. Zhu, J. Wang, and M. I. Jordan, “Deep transfer learning with joint adaptation networks,” in International conference on machine learning.   PMLR, 2017, pp. 2208–2217.
  • [50] Y. Zhu, F. Zhuang, J. Wang, G. Ke, J. Chen, J. Bian, H. Xiong, and Q. He, “Deep subdomain adaptation network for image classification,” IEEE transactions on neural networks and learning systems, vol. 32, no. 4, pp. 1713–1722, 2020.
  • [51] H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, and M. Marchand, “Domain-adversarial neural networks,” arXiv preprint arXiv:1412.4446, 2014.
  • [52] M. Long, Z. Cao, J. Wang, and M. I. Jordan, “Conditional adversarial domain adaptation,” Advances in neural information processing systems, vol. 31, 2018.
  • [53] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
g==" alt="[LOGO]">