SD-Eval: A Benchmark Dataset for Spoken Dialogue Understanding Beyond Words

Junyi Ao¹, Yuancheng Wang¹^†^†footnotemark: , Xiaohai Tian², Dekun Chen¹,
Jun Zhang², Lu Lu², Yuxuan Wang², Haizhou Li¹, Zhizheng Wu¹
¹The Chinese University of Hong Kong, Shenzhen
²Bytedance
Equal contributionCorresponding author: wuzhizheng@cuhk.edu.cn

Abstract

Speech encompasses a wealth of information, including but not limited to content, paralinguistic, and environmental information. This comprehensive nature of speech significantly impacts communication and is crucial for human-computer interaction. Chat-Oriented Large Language Models (LLMs), known for their general-purpose assistance capabilities, have evolved to handle multi-modal inputs, including speech. Although these models can be adept at recognizing and analyzing speech, they often fall short of generating appropriate responses. We argue that this is due to the lack of principles on task definition and model development, which requires open-source datasets and metrics suitable for model evaluation. To bridge the gap, we present SD-Eval, a benchmark dataset aimed at multidimensional evaluation of spoken dialogue understanding and generation. SD-Eval focuses on paralinguistic and environmental information and includes 7,303 utterances, amounting to 8.76 hours of speech data. The data is aggregated from eight public datasets, representing four perspectives: emotion, accent, age, and background sound. To assess the SD-Eval benchmark dataset, we implement three different models and construct a training set following a similar process as SD-Eval. The training set contains 1,052.72 hours of speech data and 724.4k utterances. We also conduct a comprehensive evaluation using objective evaluation methods (e.g. BLEU and ROUGE), subjective evaluations and LLM-based metrics for the generated responses. Models conditioned with paralinguistic and environmental information outperform their counterparts in both objective and subjective measures. Moreover, experiments demonstrate LLM-based metrics show a higher correlation with human evaluation compared to traditional metrics. We open-source SD-Eval at https://github.com/amphionspace/SD-Eval.

1 Introduction

Speech contains rich information and plays a crucial role in human-computer interaction [9, 11, 36]. Besides relying on the content information, speech also conveys paralinguistic and environmental information, which can significantly influence conversations. More specifically, the information carried in speech can be categorized into three classes: content information, environmental information and paralinguistic information, as illustrated in Figure 1(a).

Refer to caption — (a) Speech carries rich information including linguistic, para-linguistic and environmental information

The content information refers to the “choice of words”, representing the explicit meaning and linguistic structure of the speech. Environmental information pertains to “location of conversation”, capturing the factors such as background noise and situational context that can influence the interpretation of the speech. Paralinguistic information, which is further divided into “who says” and “how to say”, includes various non-verbal elements that convey additional meaning. “who says” involves aspects like accent, age, and timber of the speaker, which can affect the perception and understanding of the speech. “how to say” includes prosody, volume, and rhythm, detailing the vocal nuances that contribute to the expressive quality of the speech. Together, all information highlights the multifaceted nature of spoken dialogue, extending beyond mere words to encompass a wide array of information. Figure 1(b) illustrates how environmental and paralinguistic information, such as emotion, accent and age, impact responses.

Large Language Models (LLMs) have shown remarkable capabilities as a universal interface for general-purpose assistance [1, 50, 51, 52, 60]. Recently, LLMs have evolved to understand not only text but also multi-modal inputs, such as speech and image [29, 65, 61, 8, 49, 20, 39], which broadens the scope of what LLMs can achieve. The capabilities of LLMs with speech input (Speech LLMs) are primarily designed for the perception of speech and analysis of tasks defined by a text instruction prompt. This enables the model not only to recognize content but also to perceive additional information, allowing it to perform various speech-related tasks such as speech recognition and gender classification. However, due to the lack of principles on task definitions and model development, they usually fail to generate appropriate responses directly with speech input. The development of advanced Speech LLMs requires open-source datasets and metrics suitable for model evaluation from every aspect of the rich information carried in speech.

We present a novel benchmark dataset for multidimensional evaluation of spoken dialogue understanding beyond words, namely SD-Eval. The dataset is to promote the development of more empathetic and intelligent spoken dialogue systems that can generate appropriate responses based on paralinguistic and environmental information. The ultimate goal of SD-Eval is to create a benchmark dataset for speech-to-speech conversation system development. As an initial step, SD-Eval focuses on speech-to-text dialogue. The initial version of SD-Eval consists of four sub-tasks, each focusing on evaluating responses to input utterances with different emotions, accents, ages, and background sounds. These sub-tasks are constructed from eight public datasets containing real-recorded speeches. More specifically, SD-Eval comprises four subsets: test-emo, test-acc, test-age, and test-env for emotion, accent, age and background sound, respectively. It includes 7,303 utterances, totalling 8.76 hours of speech data.

To assess the SD-Eval benchmark dataset, we implement three different models and construct a training set following a similar process as SD-Eval. The training set contains 1,052.72 hours of speech data and 724.4k utterances. We also conduct an empirical study of evaluation metrics using objective evaluation methods (e.g. BLEU and ROUGE), subjective opinion score and LLM-based metrics for the generated responses.

2 Related Work

Spoken Conversation Datasets with Paralinguistic Label

Paralinguistic information is crucial for comprehending speech and generating responses in spoken dialogues. Many speech emotion datasets are constructed under spoken dialogue scenarios, such as IEMOCAP [5], SEMAINE [34], and MELD [43]. However, their primary purpose is to identify emotions in speech. Consequently, the dialogue data from these datasets is relatively less suited for training a spoken dialogue system.

Some recent studies build novel datasets such as E-chat200 [57] and StyleTalk [28], which are designed for spoken dialogue with a focus on emotional information. Nevertheless, the text and speech in these datasets are generated using ChatGPT and text-to-speech (TTS) models. Our dataset is based on a mixture of real-recording and synthesized speech and focuses on multiple aspects, including accents, emotions, ages, and background sounds.

Spoken Question Answering

The spoken question answering (SQA) task requires the system to answer questions from speech. The past approaches [53, 48] mainly divided this task into two parts through a cascaded model: automatic speech recognition (ASR) and text question answering. Recently, some systems [59, 35] aim to achieve end-to-end spoken question answering.

Datasets in the field of SQA include Spoken SQuAD [26], SCQA [59], HeySQuAD [56], OpenSAQA [16], e.g. These datasets lack annotations of paralanguage information. StyleTalk [28] provides annotations of speaking styles. Our work focuses more on paralinguistic and environmental information to simulate more realistic dialogue scenarios.

Evaluation Metrics for Open-Ended Generation Tasks

Assessing the quality of text produced by language models or human authors for open-ended generation tasks has always been a difficult task. Traditional evaluation metrics such as BLEU [41] and ROUGE [27] are based on the n-grams to measure the similarity between model outputs and references, while these metrics focus on lexical overlap, which is ineffective for open-ended generation problems. In addition, they show a relatively weak correlation with human judgement [37]. Embedding-based metrics, such as BERTScore [62], use word or sentence embeddings to measure semantic similarity based on the references.

However, the answers to these tasks are open-ended without standard references, while collecting human preferences can be costly and laborious. Recently, several works [30, 14, 63] try to use LLMs for evaluating the responses of chat assistants, which shows a high correlation with human judgement. In our work, we adapt these LLM-based methods for spoken dialogue generation, with a focus on paralinguistic and environmental information.

3 SD-Eval Benchmark Dataset

3.1 Dataset Construction

SD-Eval is divided into four subsets: test-emo, test-acc, test-age, and test-env. Each subset focuses on a specific aspect: emotion, accent, age, and environment, respectively. The ultimate aim of SD-Eval is to create a benchmark dataset for the evaluation of speech-to-speech conversation systems. As a preliminary step, SD-Eval concentrates on speech-to-text dialogues. We construct SD-Eval through the following steps.

Table 1: Statistics of the SD-Eval benchmark dataset, which includes four types of paralinguistic and environmental information.

Type	# Hours	# Utts	Constructed From	Labels
Emotion	1.11	1,289	RAVDESS [32], MEAD [55],	Sad, Angry, Fear, Disgust, Happy
(test-emo)			JL Corpus [21]
Accent (test-acc)	5.34	4,310	VCTK [58], Common Voice [3]	England, Scottish, Northern Irish,
				Welsh, Irish, American, Canadian,
				Australian, New Zealand
Environment (test-env)	0.74	690	LibriSpeech [40], AudioCaps [23], Synthesised Speech	Driving, Children’s Voice, Sea Beach,
				Raining or Thundering, Bells,
				Sports Center, Bus or Subway
Age	1.57	1,014	MyST [44], Synthesised Speech	Adult, Child
(test-age)
Summary	8.76	7,303	-	-

Data Collection

As shown in Table 1, we select data from 8 public datasets to construct SD-Eval. For test-emo subset, RAVDESS [32], MEAD [55], and JL Corpus [21] are selected as they contain audios with the same content but different emotions. For test-env subset, we choose real-recording speeches from the LibriSpeech [40] test-clean subset and add background sounds using audio samples from AudioCaps [23].

Synthetic Data Generation

For test-age and test-env, a portion of the data is synthesized. For test-age, we use an internal zero-shot TTS model, which is trained on Libri-light, to generate speech data from the text in MyST [44] with adult speakers. For each text, we randomly select a sample from the LibriSpeech test-clean subset [40] as the prompt to synthesize the data. For test-env, we first select audio collections corresponding to seven types of environments from AudioCaps [23]. Then, we mix each speech sample in the subset of LibriSpeech test-clean with audio randomly selected from these collections corresponding to each environmental scene. Simultaneously, we utilize GPT-4-turbo [1] to generate dialogue data for these seven scenarios and employ the TTS model to generate speech, forming part of the test-env subset. Details of zero-shot TTS and the prompt are introduced in the Appendix A.2 and A.5, respectively.

Label Normalization

Due to the varying number of label categories across different datasets, we first normalize the labels of all datasets. Specifically, labels of test-acc include nine widely used and representative accents: England, Scottish, Irish, Welsh, Northern Irish, American, Canadian, Australian, and New Zealand. For the test-emo subset, we firstly utilize Ekman’s emotion model [13] as the labels, which contain neutral, surprise, sad, happy, angry, disgust, and fear, which are the basic emotions. We choose Ekman’s emotion model because it is widely used in speech emotion recognition task [5, 32, 55], ensuring that each category of emotion is well-represented and encompasses a substantial amount of data.

We then further exclude utterances with neutral and surprise emotions. Neutral implies that the speech does not convey positive or negative feelings, making the response primarily content-dependent. However, our focus is on examining the impact of speech emotion on text responses. Similarly, surprise can be associated with different sentiments, depending on the context [46]. Therefore, we excluded data related to these two emotions. As a result, the test-emo subset includes five types of emotions: sad, happy, angry, disgust, and fear.

For the test-env subset, we select seven representative scenarios in daily life to serve as background sounds, as illustrated in Table 1. For the test-age subset, we focus on evaluating whether the model could generate comprehensible responses appropriate to different age groups. Consequently, the labels are divided into two categories: child and adult.

Data Filtering

We filter the test data from three aspects. Firstly, some utterances of the four subsets are identified with notable ambiguity, potentially due to a lack of contextual information. To address this, we design a prompt and use GPT-4-turbo [1] for automatic filtering, as illustrated in Figure 2. Following this initial filtering, three human annotators are then required to evaluate the remaining utterances further using the same criteria as the prompt. Secondly, it is observed that some utterances within the test-env contain incorrect background sounds, possibly due to the multi-class labelling of the AudioCaps [23]. These utterances are subsequently identified and filtered by human annotators. Finally, we exclude utterances of test-emo subset where both the sentiment of transcript and emotion are positive or negative, aiming to enhance the impact of emotions on responses. For this purpose, a pre-trained sentiment classification model ¹¹1https://huggingface.co/lxyuan/distilbert-base-multilingual-cased-sentiments-student is employed to predict the sentiments of utterances.

Punctuation Restoration

Traditional metrics, such as BLEU [41], require references as input, so we try to use ChatGPT [38] to generate responses for each utterance. However, the transcripts of three datasets, MEAD [55], LibriSpeech [40] and UK-Ireland dataset [12], do not contain punctuation, which may degrade the quality of generated responses. To address this issue, we employ a punctuation restoration model ²²2https://github.com/notAI-tech/fastPunct to add punctuation for transcripts of these two datasets.

Response Generation

Finally, we use GPT-4o [39] to generate five diverse responses for each utterance in SD-Eval by considering the content and emotion, accent, age or background sounds of speech signals. For instance, the prompt used to generate responses for utterances related to emotion is presented in Figure 3. All the prompts used to generate responses are included in the Appendix A.3.2.

3.2 Dataset Statistics

The statistics of SD-Eval are presented in Table 1. The SD-Eval dataset comprises a total of 7,303 sentences and 8.76 hours of speech data. It contains three types of paralinguistic information (i.e. emotion, accent, age), and the environment type contains seven categories of environmental sounds. The pie charts in Figure 4 illustrate the data distribution for each category within each test set.

4 Benchmark Experiments

4.1 Training Set

To assess the SD-Eval benchmark dataset, we construct a training dataset from eleven open datasets for training models. We follow a procedure similar to SD-Eval, with the following two exceptions. Firstly, we simplify the data filtering process by removing sentences with inadequate and ambiguous labels. Secondly, we generated only one response for each sentence. The details, including data statistics and prompts, are introduced in the Appendix A.1 and A.3.1.

4.2 Models

We implement several baselines trained using the proposed training set, aiming to evaluate their capability of comprehending the content of the speech, as well as recognizing emotions, accents, age, or background sounds. The implementations are detailed as follows.

Cascade LLM

As shown in Figure 5(a), the Cascade LLM consists of an automatic speech recognition (ASR) model to recognize the content, followed by an LLM to generate a response based on the text input. The ASR model is Whisper large-v3 [45], which is trained with a large amount of weakly supervised data for speech recognition and translation. During training, the pre-trained LLM with 7 billion parameters is frozen, while we add a trainable LoRA adaptor [19] to facilitate model finetuning. We use this model as a baseline to evaluate responses if only knowing the content of the speech.

VS-LLM

To understand and perceive content as well as paralinguistic and environmental information directly from speech, we design an end-to-end model named Vanilla Speech LLM (VS-LLM). As shown in 5(b), it consists of a speech encoder, an LLM and an additional adaptor to connect the speech encoder and LLM. The encoder of Whisper large-v3 [45] is used as the speech encoder, followed by a trainable adaptor to further down-sample the speech representation from the speech encoder. The adaptor comprises two linear layers, where the first linear layer is succeeded by a GELU activation function [17], while the second one is followed by a two-dimensional average pooling operation for down-sampling. Similar to Cascade LLM, a trainable LoRA adaptor is employed for the pre-trained LLM.

LLM (Upper Bound)

To assess the system’s upper bound upon the speech transcript, we also provide paralinguistic or environmental information as an additional label to the frozen LLM with LoRA for model finetuning. The input format is a concatenation of ground-truth transcripts and labels. For instance, “How are you?<Emotion:Happy>” is the input of an utterance. The transcript of this utterance is "How are you?" The emotion contained in this utterance is happy.

Qwen-Audio

Besides the above self-implemented models, we further assess the performance of the off-the-shelf speech LLM model, e.g. Qwen-Audio [8], on SD-Eval. Since Qwen-Audio requires a text instruction prompt for each input to define the task, we add a text instruction prompt to let the model generate a text response based on the speech, which is “How to respond to the audio?”

4.3 Evaluation Metrics

Objective Evaluation

We propose a reference-free metric using the LLMs for response evaluation. Specifically, we design different prompts for the evaluations of each subset. All prompts are introduced in Appendix A.4. By the prompts, the LLM must consider (a) the response’s naturalness, coherence, engagingness and groundedness. (b) Whether the response is appropriate and fully considers the emotion, accent, age or background sound of input speech. The LLM is then asked to directly assign a score, such as 5 on a 1 - 10 scale, to a single answer. For comparison, we further include the results of n-gram-based metrics, such as ROUGE-L [27], BLEU-4 [41] and METEOR [4], and embedding-based metrics, such as BERTScore [62] ³³3We use Hugging Face Evaluate for scoring and the BERT model is roberta-large..

Subjective Evaluation

In addition, we conduct a human evaluation on 200 randomly selected utterances from the four subsets, with each subset contributing 50 utterances. Each sample is assessed by at least three human evaluators, who are instructed to rate the generated responses. Each utterance has three samples, corresponding to three utterance-response pairs generated by Cascade LLM, VS-LLM, and LLM (Upper Bound), respectively. We ensure that each valid sample is evaluated by at least three human annotators. Consequently, each subset has no fewer than 120 valid samples.

4.4 Experimental Setup

All models implemented by ourselves are built using xtuner [10]. We optimize the model with AdamW [31] with a learning rate of $2\times 10^{-4}$ . The models are finetuned on 16 A100 GPUs, each with a batch size of 16, for two epochs. For the LoRA adaptor of the LLM, we use a rank of 512 and $\alpha$ of 256. In contrast, for the encoder of the Whisper large-v3 model, the rank is set to 64 and $\alpha$ to 16.

Table 2: Main results of five models on four subsets of SD-Eval.

\dagger

The scores from human evaluations are calculated based on randomly sampled data as described in Section 4.3.

test-emo / Emotion
Model	BLEU-4	ROUGE-L	METEOR	BERTScore	GPT-4o	Human Evaluation $\dagger$
Qwen-Audio [8]	3.93	19.02	16.82	86.59	2.24	-
Cascade LLM	5.57	22.58	22.29	87.98	4.69	4.28
VS-LLM	9.35	26.22	28.08	89.46	5.27	6.34
LLM (Upper Bound)	11.89	27.08	29.64	89.70	6.45	7.35
test-acc / Accent
Qwen-Audio [8]	4.52	17.15	17.78	85.59	1.72	-
Cascade LLM	13.45	29.92	32.76	89.55	6.61	5.83
VS-LLM	16.55	32.58	36.13	89.98	7.89	7.51
LLM (Upper Bound)	17.60	33.51	37.81	90.14	8.03	7.92
test-age / Age
Qwen-Audio [8]	7.28	23.09	21.80	86.72	2.50	-
Cascade LLM	15.65	31.76	32.16	90.02	6.93	6.88
VS-LLM	17.45	33.91	33.70	90.49	7.97	7.71
LLM (Upper Bound)	19.65	35.58	36.20	91.00	8.37	8.27
test-env / Environment
Qwen-Audio [8]	2.37	16.83	17.50	85.81	2.14	-
Cascade LLM	4.76	21.46	25.54	88.11	5.35	7.14
VS-LLM	8.16	24.76	26.48	88.95	5.59	7.35
LLM (Upper Bound)	10.47	27.32	31.68	89.55	7.58	8.68

4.5 Main Results

Table 2 shows the main results of all models on SD-Eval. Firstly, across all four test sets, VS-LLM outperformed Cascade LLM on all metrics. This indicates that using speech as a direct input allows VS-LLM to implicitly learn paralinguistic and environmental information. Secondly, the performance of VS-LLM is inferior to that of LLM. The main reason may be that VS-LLM implicitly acquires content as well as paralinguistic and environmental information directly from speech, whereas the LLM (Upper Bound) utilizes ground truth transcripts and labels. This indicates that the way to process the input data is important for model performance. A detailed ablation study regarding the input data will be introduced later. Finally, despite Qwen-Audio achieving good results in many tasks [8], its performance in SD-Eval is not very impressive. This suggests a current lack of well-defined tasks and datasets in this area.

4.6 Analysis

Ablation Study of Input Data

We further conduct an ablation study in terms of the input data, as shown in Table 3. We investigate several models with different inputs. Among them, Model 1, which belongs to Speech LLM and is without any text input, refers to VS-LLM. Model 4 utilizing transcripts from the ASR model as input is Cascade LLM. Additionally, Model 8 uses ground truth transcripts and labels, which is LLM (Upper Bound). For ASR and speech emotion recognition (SER), the models are Whisper large-v3 [15] and emotion2vec [33] ⁴⁴4https://huggingface.co/emotion2vec/emotion2vec_plus_seed.

Table 3: Ablation study on test-emo subset. The model types include LLM (text input only) and Speech LLM (text and speech inputs). “Trans” refers to the method used to obtain the transcripts. Options include “ASR” (generated by an ASR model) and “GT” (ground-truth transcript). “Emotion Label” indicates the source of the speech emotion label for the utterance, either “SER” (produced by a speech emotion recognition model) or “GT” (ground-truth label). “N/A” means the input is not used for the model.

Index	Model Type	Trans	Emotion Label	BLEU-4	ROUGE-L	METEOR	BERTScore	GPT-4o
1	Speech LLM	N/A	N/A	9.35	26.22	28.08	89.46	5.27
2		N/A	SER	9.34	26.35	28.92	89.64	5.99
3		N/A	GT	9.55	26.53	29.01	89.64	6.26
4	LLM	ASR	N/A	5.57	22.58	22.29	87.98	4.69
5		ASR	SER	10.91	26.47	29.19	89.59	5.97
6		GT	SER	11.32	26.78	29.53	89.67	6.25
7		ASR	GT	11.42	26.73	29.20	89.61	6.35
8		GT	GT	11.89	27.08	29.64	89.70	6.45

Firstly, we examine the effect of content quality. We observe that the performance of models utilizing ASR-generated transcripts (Model 5 and Model 7) is inferior across all metrics compared to their counterparts (Model 6 and Model 8) that use ground-truth transcripts. Next, we examine the effect of emotion label quality. For the LLM-based system, models using emotion labels from the SER model (Model 5 and Model 6) perform worse across all metrics compared to those using ground-truth labels (Model 7 and Model 8). Model 4, which is trained without emotion labels, performs the worst. A similar trend is observed in the models of Speech LLM, where Model 2 obtained emotion labels from the SER model outperforms Model 1, while Model 3, trained with ground truth labels, achieves the best performance among all three models. This corroborates our hypothesis in the section 4.5.

Correlations between Objective Metrics and Human Evaluation

Finally, we investigate the correlations between scores of objective metrics and human evaluation, as shown in Table 4. Following the configuration of GPTScore [14], we utilize dataset-level Spearman and Kendall-Tau correlation metrics. The experimental results indicate that GPT-4o [39] exhibits a significantly higher correlation with human evaluations compared to other metrics. These findings strongly validate the effectiveness of LLMs as evaluation metrics.

Table 4: Spearman (

\rho

) and Kendall-Tau (

\tau

) correlations between human evaluation and different metrics on test-emo subset.

Metrics	test-emo		test-acc		test-age		test-env		Average
Metrics	$\rho$	$\tau$	$\rho$	$\tau$	$\rho$	$\tau$	$\rho$	$\tau$	$\rho$	$\tau$
BLEU-4	0.179	0.137	0.202	0.157	0.288	0.211	0.028	0.023	0.186	0.143
ROUGE-L	0.220	0.152	0.173	0.122	0.317	0.213	0.037	0.022	0.199	0.134
METEOR	0.373	0.246	0.217	0.149	0.299	0.209	0.247	0.165	0.296	0.200
BERTScore	0.258	0.173	-0.044	-0.029	0.407	0.284	0.217	0.141	0.200	0.134
GPT-4o	0.670	0.527	0.651	0.484	0.480	0.354	0.666	0.529	0.617	0.463

5 Conclusion

In this paper, we introduce SD-Eval, a benchmark dataset designed for the multidimensional evaluation of spoken dialogue understanding and generation. SD-Eval includes 7,303 utterances amounting to 8.76 hours of speech data, aggregated from eight public datasets, and focuses on paralinguistic and environmental information across four perspectives: emotion, accent, age, and background sound. The dataset aims to advance the creation of more empathetic and intelligent spoken dialogue systems capable of generating appropriate responses by considering paralinguistic and environmental information. Our comprehensive evaluation demonstrates that models conditioned with paralinguistic or environmental information outperform their counterparts in both objective evaluation and subjective evaluation. Furthermore, our experiments indicate that LLM-based metrics have a higher correlation with human evaluation compared to traditional metrics.

6 Limitations and Future Work

The limitations and future work for SD-Eval are as follows: Firstly, SD-Eval accommodates only speech-to-text dialogues, limiting the evaluation of system responses at the text level. Secondly, SD-Eval currently supports the evaluation of single-turn dialogues only, limiting its application to more complex, multi-turn interactions. Finally, SD-Eval includes four sub-tasks that focus on speech elements such as emotion, accent, age, and environmental information. However, it does not yet account for other aspects, such as the gender of the speaker. Addressing these aspects constitutes our future work, with the ultimate goal of developing a benchmark dataset capable of multidimensional evaluation for multi-turn speech-to-speech dialogues.

References

Achiam et al. [2023] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
Adigwe et al. [2018] Adaeze Adigwe, Noé Tits, Kevin El Haddad, Sarah Ostadabbas, and Thierry Dutoit. The emotional voices database: Towards controlling the emotion dimension in voice generation systems. arXiv preprint arXiv:1806.09514, 2018.
Ardila et al. [2020] R. Ardila, M. Branson, K. Davis, M. Henretty, M. Kohler, J. Meyer, R. Morais, L. Saunders, F. M. Tyers, and G. Weber. Common voice: A massively-multilingual speech corpus. In Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), pages 4211–4215, 2020.
Banerjee and Lavie [2005] Satanjeev Banerjee and Alon Lavie. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Jade Goldstein, Alon Lavie, Chin-Yew Lin, and Clare Voss, editors, Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pages 65–72, Ann Arbor, Michigan, June 2005. Association for Computational Linguistics. URL https://aclanthology.org/W05-0909.
Busso et al. [2008] Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe Kazemzadeh, Emily Mower, Samuel Kim, Jeannette N Chang, Sungbok Lee, and Shrikanth S Narayanan. Iemocap: Interactive emotional dyadic motion capture database. Language resources and evaluation, 42:335–359, 2008.
Cao et al. [2014] Houwei Cao, David G. Cooper, Michael K. Keutmann, Ruben C. Gur, Ani Nenkova, and Ragini Verma. Crema-d: Crowd-sourced emotional multimodal actors dataset. IEEE Transactions on Affective Computing, 5(4):377–390, 2014. doi: 10.1109/TAFFC.2014.2336244.
Chen et al. [2022] Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, et al. Wavlm: Large-scale self-supervised pre-training for full stack speech processing. IEEE Journal of Selected Topics in Signal Processing, 16(6):1505–1518, 2022.
Chu et al. [2023] Yunfei Chu, Jin Xu, Xiaohuan Zhou, Qian Yang, Shiliang Zhang, Zhijie Yan, Chang Zhou, and Jingren Zhou. Qwen-audio: Advancing universal audio understanding via unified large-scale audio-language models. arXiv preprint arXiv:2311.07919, 2023.
Cohen and Oviatt [1995] P R Cohen and S L Oviatt. The role of voice input for human-machine communication. Proceedings of the National Academy of Sciences, 92(22):9921–9927, 1995. doi: 10.1073/pnas.92.22.9921. URL https://www.pnas.org/doi/abs/10.1073/pnas.92.22.9921.
Contributors [2023] XTuner Contributors. Xtuner: A toolkit for efficiently fine-tuning llm. https://github.com/InternLM/xtuner, 2023.
Cowie et al. [2001] R. Cowie, E. Douglas-Cowie, N. Tsapatsoulis, G. Votsis, S. Kollias, W. Fellenz, and J.G. Taylor. Emotion recognition in human-computer interaction. IEEE Signal Processing Magazine, 18(1):32–80, 2001. doi: 10.1109/79.911197.
Demirsahin et al. [2020] Isin Demirsahin, Oddur Kjartansson, Alexander Gutkin, and Clara Rivera. Open-source Multi-speaker Corpora of the English Accents in the British Isles. In Proceedings of The 12th Language Resources and Evaluation Conference (LREC), pages 6532–6541, Marseille, France, May 2020. European Language Resources Association (ELRA). ISBN 979-10-95546-34-4. URL https://www.aclweb.org/anthology/2020.lrec-1.804.
Ekman and Friesen [1971] Paul Ekman and Wallace V Friesen. Constants across cultures in the face and emotion. Journal of personality and social psychology, 17(2):124, 1971.
Fu et al. [2023] Jinlan Fu, See-Kiong Ng, Zhengbao Jiang, and Pengfei Liu. Gptscore: Evaluate as you desire. arXiv preprint arXiv:2302.04166, 2023.
Gong et al. [2023a] Yuan Gong, Sameer Khurana, Leonid Karlinsky, and James Glass. Whisper-at: Noise-robust automatic speech recognizers are also strong general audio event taggers. arXiv preprint arXiv:2307.03183, 2023a.
Gong et al. [2023b] Yuan Gong, Alexander H Liu, Hongyin Luo, Leonid Karlinsky, and James Glass. Joint audio and speech understanding. In 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 1–8. IEEE, 2023b.
Hendrycks and Gimpel [2016] Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415, 2016.
Ho and Salimans [2022] Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.
Hu et al. [2021] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
Hu et al. [2024] Shujie Hu, Long Zhou, Shujie Liu, Sanyuan Chen, Hongkun Hao, Jing Pan, Xunying Liu, Jinyu Li, Sunit Sivasankaran, Linquan Liu, et al. Wavllm: Towards robust and adaptive speech large language model. arXiv preprint arXiv:2404.00656, 2024.
James et al. [2018] Jesin James, Li Tian, and Catherine Inez Watson. An Open Source Emotional Speech Corpus for Human Robot Interaction Applications. In Proc. Interspeech 2018, pages 2768–2772, 2018. doi: 10.21437/Interspeech.2018-1349.
Ju et al. [2024] Zeqian Ju, Yuancheng Wang, Kai Shen, Xu Tan, Detai Xin, Dongchao Yang, Yanqing Liu, Yichong Leng, Kaitao Song, Siliang Tang, et al. Naturalspeech 3: Zero-shot speech synthesis with factorized codec and diffusion models. arXiv preprint arXiv:2403.03100, 2024.
Kim et al. [2019] Chris Dongjoo Kim, Byeongchang Kim, Hyunmin Lee, and Gunhee Kim. Audiocaps: Generating captions for audios in the wild. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 119–132, 2019.
Kim et al. [2024] Jaehyeon Kim, Keon Lee, Seungjun Chung, and Jaewoong Cho. Clam-tts: Improving neural codec language model for zero-shot text-to-speech. arXiv preprint arXiv:2404.02781, 2024.
Łajszczak et al. [2024] Mateusz Łajszczak, Guillermo Cámbara, Yang Li, Fatih Beyhan, Arent van Korlaar, Fan Yang, Arnaud Joly, Álvaro Martín-Cortinas, Ammar Abbas, Adam Michalski, et al. Base tts: Lessons from building a billion-parameter text-to-speech model on 100k hours of data. arXiv preprint arXiv:2402.08093, 2024.
Li et al. [2018] Chia-Hsuan Li, Szu-Lin Wu, Chi-Liang Liu, and Hung-yi Lee. Spoken squad: A study of mitigating the impact of speech recognition errors on listening comprehension. arXiv preprint arXiv:1804.00320, 2018.
Lin [2004] Chin-Yew Lin. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain, July 2004. Association for Computational Linguistics. URL https://aclanthology.org/W04-1013.
Lin et al. [2024] Guan-Ting Lin, Cheng-Han Chiang, and Hung-yi Lee. Advancing large language models to capture varied speaking styles and respond properly in spoken conversations. arXiv preprint arXiv:2402.12786, 2024.
Liu et al. [2024] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances in neural information processing systems, 36, 2024.
Liu et al. [2023] Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G-eval: NLG evaluation using gpt-4 with better human alignment. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 2511–2522, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.153. URL https://aclanthology.org/2023.emnlp-main.153.
Loshchilov and Hutter [2017] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
Lotfian and Busso [2019] Reza Lotfian and Carlos Busso. Building naturalistic emotionally balanced speech corpus by retrieving emotional speech from existing podcast recordings. IEEE Transactions on Affective Computing, 10(4):471–483, 2019. doi: 10.1109/TAFFC.2017.2736999.
Ma et al. [2023] Ziyang Ma, Zhisheng Zheng, Jiaxin Ye, Jinchao Li, Zhifu Gao, Shiliang Zhang, and Xie Chen. emotion2vec: Self-supervised pre-training for speech emotion representation. arXiv preprint arXiv:2312.15185, 2023.
McKeown et al. [2010] Gary McKeown, Michel F Valstar, Roderick Cowie, and Maja Pantic. The semaine corpus of emotionally coloured character interactions. In 2010 IEEE international conference on multimedia and expo, pages 1079–1084. IEEE, 2010.
Nachmani et al. [2023] Eliya Nachmani, Alon Levkovitch, Roy Hirsch, Julian Salazar, Chulayuth Asawaroengchai, Soroosh Mariooryad, Ehud Rivlin, RJ Skerry-Ryan, and Michelle Tadmor Ramanovich. Spoken question answering and speech continuation using spectrogram-powered llm. In The Twelfth International Conference on Learning Representations, 2023.
Nass and Brave [2005] Clifford Nass and Scott Brave. Wired for Speech: How Voice Activates and Advances the Human-Computer Relationship. The MIT Press, 2005. ISBN 0262140926.
Novikova et al. [2017] Jekaterina Novikova, Ondřej Dušek, Amanda Cercas Curry, and Verena Rieser. Why we need new evaluation metrics for nlg. arXiv preprint arXiv:1707.06875, 2017.
OpenAI [2022] OpenAI. Chatgpt, 2022. https://openai.com/blog/chatgpt/.
OpenAI [2024] OpenAI. Gpt-4o, 2024. https://openai.com/index/hello-gpt-4o/.
Panayotov et al. [2015] Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. Librispeech: An asr corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5206–5210, 2015. doi: 10.1109/ICASSP.2015.7178964.
Papineni et al. [2002] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In Pierre Isabelle, Eugene Charniak, and Dekang Lin, editors, Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA, July 2002. Association for Computational Linguistics. doi: 10.3115/1073083.1073135. URL https://aclanthology.org/P02-1040.
Peng et al. [2024] Puyuan Peng, Po-Yao Huang, Daniel Li, Abdelrahman Mohamed, and David Harwath. Voicecraft: Zero-shot speech editing and text-to-speech in the wild. arXiv preprint arXiv:2403.16973, 2024.
Poria et al. [2019] Soujanya Poria, Devamanyu Hazarika, Navonil Majumder, Gautam Naik, Erik Cambria, and Rada Mihalcea. MELD: A multimodal multi-party dataset for emotion recognition in conversations. In Anna Korhonen, David Traum, and Lluís Màrquez, editors, Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 527–536, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1050. URL https://aclanthology.org/P19-1050.
Pradhan et al. [2024] Sameer Pradhan, Ronald A. Cole, and Wayne H. Ward. My science tutor (MyST)–a large corpus of children’s conversational speech. In Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, and Nianwen Xue, editors, Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 12040–12045, Torino, Italia, May 2024. ELRA and ICCL. URL https://aclanthology.org/2024.lrec-main.1052.
Radford et al. [2023] Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. In International Conference on Machine Learning, pages 28492–28518. PMLR, 2023.
Sailunaz and Alhajj [2019] Kashfia Sailunaz and Reda Alhajj. Emotion and sentiment analysis from twitter text. Journal of computational science, 36:101003, 2019.
Sanchez et al. [2023] Guillaume Sanchez, Honglu Fan, Alexander Spangher, Elad Levi, Pawan Sasanka Ammanamanchi, and Stella Biderman. Stay on topic with classifier-free guidance. arXiv preprint arXiv:2306.17806, 2023.
Su and Fung [2020] Dan Su and Pascale Fung. Improving spoken question answering using contextualized word representation. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 8004–8008. IEEE, 2020.
Tang et al. [2024] Changli Tang, Wenyi Yu, Guangzhi Sun, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun MA, and Chao Zhang. SALMONN: Towards generic hearing abilities for large language models. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=14rn7HpKVk.
Team et al. [2023] Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
Touvron et al. [2023a] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
Touvron et al. [2023b] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
Tseng et al. [2016] Bo-Hsiang Tseng, Sheng-Syun Shen, Hung-Yi Lee, and Lin-Shan Lee. Towards machine comprehension of spoken content: Initial toefl listening comprehension test by machine. arXiv preprint arXiv:1608.06378, 2016.
Wang et al. [2023] Chengyi Wang, Sanyuan Chen, Yu Wu, Ziqiang Zhang, Long Zhou, Shujie Liu, Zhuo Chen, Yanqing Liu, Huaming Wang, Jinyu Li, et al. Neural codec language models are zero-shot text to speech synthesizers. arXiv preprint arXiv:2301.02111, 2023.
Wang et al. [2020] Kaisiyuan Wang, Qianyi Wu, Linsen Song, Zhuoqian Yang, Wayne Wu, Chen Qian, Ran He, Yu Qiao, and Chen Change Loy. Mead: A large-scale audio-visual dataset for emotional talking-face generation. In ECCV, 2020.
Wu et al. [2023] Yijing Wu, SaiKrishna Rallabandi, Ravisutha Srinivasamurthy, Parag Pravin Dakle, Alolika Gon, and Preethi Raghavan. Heysquad: A spoken question answering dataset. arXiv preprint arXiv:2304.13689, 2023.
Xue et al. [2023] Hongfei Xue, Yuhao Liang, Bingshen Mu, Shiliang Zhang, Qian Chen, and Lei Xie. E-chat: Emotion-sensitive spoken dialogue system with large language models. arXiv preprint arXiv:2401.00475, 2023.
Yamagishi et al. [2019] Junichi Yamagishi, Christophe Veaux, and Kirsten MacDonald. CSTR VCTK Corpus: English multi-speaker corpus for CSTR voice cloning toolkit, 2019.
You et al. [2022] Chenyu You, Nuo Chen, Fenglin Liu, Shen Ge, Xian Wu, and Yuexian Zou. End-to-end spoken conversational question answering: Task, dataset and model. arXiv preprint arXiv:2204.14272, 2022.
Zhang et al. [2023a] Dong Zhang, Shimin Li, Xin Zhang, Jun Zhan, Pengyu Wang, Yaqian Zhou, and Xipeng Qiu. SpeechGPT: Empowering large language models with intrinsic cross-modal conversational abilities. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Findings of the Association for Computational Linguistics: EMNLP 2023, pages 15757–15773, Singapore, December 2023a. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-emnlp.1055. URL https://aclanthology.org/2023.findings-emnlp.1055.
Zhang et al. [2023b] Pan Zhang, Xiaoyi Dong Bin Wang, Yuhang Cao, Chao Xu, Linke Ouyang, Zhiyuan Zhao, Shuangrui Ding, Songyang Zhang, Haodong Duan, Hang Yan, et al. Internlm-xcomposer: A vision-language large model for advanced text-image comprehension and composition. arXiv preprint arXiv:2309.15112, 2023b.
Zhang* et al. [2020] Tianyi Zhang*, Varsha Kishore*, Felix Wu*, Kilian Q. Weinberger, and Yoav Artzi. Bertscore: Evaluating text generation with bert. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=SkeHuCVFDr.
Zheng et al. [2024] Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 36, 2024.
Zhou et al. [2022] Kun Zhou, Berrak Sisman, Rui Liu, and Haizhou Li. Emotional voice conversion: Theory, databases and esd. Speech Communication, 137:1–18, 2022. ISSN 0167-6393. doi: https://doi.org/10.1016/j.specom.2021.11.006. URL https://www.sciencedirect.com/science/article/pii/S0167639321001308.
Zhu et al. [2023] Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.

Appendix A Appendix

A.1 Statistics of Training Set

Table 5 shows the statistics of training set. For training data related to the environment, we generate one response for each sentence, except for data related to the environment, which has five different responses for each sentence to serve the purpose of data augmentation.

Table 5: Statistics of training set. ChatGPT Version refers to the specific version of ChatGPT used to generate the data.

Type	# Hours	# Utts	Constructed From	Labels	ChatGPT Version
Emotion	120.60	100.5k	MSP-Podcast [32], IEMOCAP [5],	Angry, Contempt, Disgust, Fear, Happy,	GPT-3.5-Turbo
			MELD [43], EmoV-DB [2],	Neutral, Sad, Surprise,Frustrated
			ESD [64], CREMA-D [6]	Excited, Amused, Sleepiness
Accent	759.75	508.6k	UK-Ireland dataset [12], VCTK [58], Common Voice [3]	England, Scottish, Northern Irish,	GPT-4o
				Welsh, Irish, American, Canadian,
				Australian, Nea Zealand
Environment	32.06	47.1k	LibriSpeech [40], AudioCaps [23], Synthesised Speech	Driving, Children’s Voice, Sea Beach,	GPT-4-Turbo
				Raining or Thundering, Bells,
				Sports Center, Shopping Center,
				Bus or Subway
Age	140.31	73.2k	MyST [44]	Child	GPT-3.5-Turbo
Summary	1,052.72	729.4k	-	-	-

A.2 Zero-shot TTS Model

Our internal zero-shot TTS model is an auto-regressive model, which is similar to BASE-TTS [25]. We evaluate our TTS model with some objective metrics. We assess objective metrics including speaker similarity (SIM-O and SIM-R), and robustness (WER) in the following ways: 1) To evaluate speaker similarity, we use the WavLM-TDCNN [7] speaker embedding model. This model measures how closely generated samples match the original prompt (SIM-O) and the reconstructed prompt (SIM-R). 2) For measuring robustness, we calculate the Word Error Rate (WER) using a CTC-based HuBERT model⁵⁵5https://huggingface.co/facebook/hubert-large-ls960-ft that was initially trained on Librilight and subsequently finetuned on the 960-hour training dataset from LibriSpeech. We compare our models with SOTA auto-regressive TTS models: VALL-E [54], and CLaM-TTS [24], VoiceCraft [42], XTTS-v2⁶⁶6https://huggingface.co/coqui/XTTS-v2, and WhisperSpeech⁷⁷7https://github.com/collabora/WhisperSpeech. we adapt classifier-free guidance (cfg) [18, 47] for better generation. We use LibriSpeech test-clean for evaluation, which contains 40 distinct speakers. Following [54, 22], we randomly select one sentence for each speaker as the target and a 3-second clip as the prompt from the same speaker’s speech.

	Training Data	Sim-O $\uparrow$	Sim-R $\uparrow$	WER $\downarrow$
Ground Truth	-	0.68	-	0.34
VALL-E	LibriLight	-	0.58	5.9
CLaM-TTS	MLS	0.49	0.54	5.11
VoiceCraft	GigaSpeech	0.45	-	6.68
XTTS-v2	-	0.51	-	5.5
WhisperSpeech	LibriLight	0.48	-	4.78
Ours	LibriLight	0.58	0.61	5.56
Ours (w. cfg)	LibriLight	0.60	0.63	4.32
Ours (w. cfg, rerank 5)	LibriLight	0.63	0.66	2.01