research-article

Emotionally Situated Text-to-Speech Synthesis in User-Agent Conversation

Authors:

Qin JinAuthors Info & Claims

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

Pages 5966 - 5974

https://doi.org/10.1145/3581783.3613823

Published: 27 October 2023 Publication History

Abstract

Conversational Text-to-speech Synthesis (TTS) aims to generate speech with proper style in the user-agent conversation scenario. Although previous works have explored modeling the context in the dialogue history to provide style information for the agent, there are still deficiencies in modeling the role-aware multi-modal context. Moreover, previous works ignore the emotional dependencies between the user and the agent, which includes: 1) agent understands emotional states of users, and 2) agent expresses proper emotion in the generated speech. In this work, we propose an Emotionally Situated Text-to-speech Synthesis (EmoSit-TTS) framework to understand users' semantics and subtle emotional states, and generate speech with proper speaking style and emotional expression in the user-agent conversation. Experiments on the DailyTalk dataset show the superiority of our proposed framework for the user-agent conversational TTS, especially in terms of emotion-aware expressiveness, which outperforms other state-of-the-art methods by 0.69 on MOS. Demos of our proposed framework are available at https://anonydemo.github.io.

References

[1]

Jian Cong, Shan Yang, Na Hu, Guangzhi Li, Lei Xie, and Dan Su. 2021. Controllable context-aware conversational speech synthesis. arXiv preprint arXiv:2106.10828 (2021).

[2]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).

[3]

Deepanway Ghosal, Navonil Majumder, Soujanya Poria, Niyati Chhaya, and Alexander Gelbukh. [n. d.]. DialogueGCN: A Graph Convolutional Neural Network for Emotion Recognition in Conversation. ([n. d.]).

[4]

Haohan Guo, Shaofei Zhang, Frank K Soong, Lei He, and Lei Xie. 2021. Conversa-tional end-to-end tts for voice agents. In 2021 IEEE Spoken Language Technology Workshop (SLT). IEEE, 403--409.

[5]

Matthew B Hoy. 2018. Alexa, Siri, Cortana, and more: an introduction to voice assistants. Medical reference services quarterly 37, 1 (2018), 81--88.

[6]

Jingwen Hu, Yuchen Liu, Jinming Zhao, and Qin Jin. 2021. Mmgcn: Multimodal fusion via deep graph convolution network for emotion recognition in conversation. arXiv preprint arXiv:2107.06779 (2021).

[7]

Yifan Hu, Rui Liu, Guanglai Gao, and Haizhou Li. 2022. FCTalker: Fine and Coarse Grained Context Modeling for Expressive Conversational Speech Synthesis. arXiv e-prints (2022), arXiv-2210.

[8]

Andrew Jaegle, Felix Gimeno, Andy Brock, Oriol Vinyals, Andrew Zisserman, and Joao Carreira. 2021. Perceiver: General perception with iterative attention. In International conference on machine learning. PMLR, 4651--4664.

[9]

Jungil Kong, Jaehyeon Kim, and Jaekyoung Bae. 2020. Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis. Advances in Neural Information Processing Systems 33 (2020), 17022--17033.

[10]

Keon Lee, Kyumin Park, and Daeyoung Kim. 2022. DailyTalk: Spoken Dialogue Dataset for Conversational Text-to-Speech. arXiv e-prints (2022), arXiv-2207.

[11]

Jingbei Li, Yi Meng, Chenyi Li, Zhiyong Wu, Helen Meng, Chao Weng, and Dan Su. 2022. Enhancing Speaking Styles in Conversational Text-to-Speech Synthesis with Graph-Based Multi-Modal Context Modeling. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 7917--7921.

[12]

Jingbei Li, Yi Meng, Xixin Wu, Zhiyong Wu, Jia Jia, Helen Meng, Qiao Tian, Yuping Wang, and Yuxuan Wang. 2022. Inferring Speaking Styles from Multi-modal Conversational Context by Multi-scale Relational Graph Convolutional Networks. In Proceedings of the 30th ACM International Conference on Multimedia, Lisboa, Portugal.

Digital Library

[13]

Yanran Li, Hui Su, Xiaoyu Shen, Wenjie Li, Ziqiang Cao, and Shuzi Niu. 2017. DailyDialog: A Manually Labelled Multi-turn Dialogue Dataset. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 986--995.

[14]

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).

[15]

Yuchen Liu, Jinming Zhao, Jingwen Hu, Ruichen Li, and Qin Jin. 2022. DialogueEIN: Emotion Interaction Network for Dialogue Affective Analysis. In Proceedings of the 29th International Conference on Computational Linguistics. 684--693.

[16]

Ilya Loshchilov and Frank Hutter. [n. d.]. DECOUPLED WEIGHT DECAY REGULARIZATION. ([n. d.]).

[17]

Navonil Majumder, Soujanya Poria, Devamanyu Hazarika, Rada Mihalcea, Alexander Gelbukh, and Erik Cambria. 2019. Dialoguernn: An attentive rnn for emotion detection in conversations. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 6818--6825.

Digital Library

[18]

Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. 2016. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499 (2016).

[19]

Yi Ren, Chenxu Hu, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. 2020. FastSpeech 2: Fast and High-Quality End-to-End Text to Speech. In International Conference on Learning Representations.

[20]

Yi Ren, Yangjun Ruan, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. 2019. Fastspeech: Fast, robust and controllable text to speech. Advances in Neural Information Processing Systems 32 (2019).

[21]

Matej Rojc, Marko Presker, Zdravko Kačić, and Izidor Mlakar. 2014. TTS-driven expressive embodied conversation agent EVA for UMB-SmartTV. International journal of computers and communications 8 (2014), 57--66.

[22]

Jonathan Shen, Ruoming Pang, Ron J Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, Rj Skerrv-Ryan, et al. 2018. Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 4779--4783.

Digital Library

[23]

Weizhou Shen, Junqing Chen, Xiaojun Quan, and Zhixian Xie. 2020. DialogXL: All-in-one XLNet for multi-party conversation emotion recognition. arXiv preprint arXiv:2012.08695 (2020).

[24]

Heung-Yeung Shum, Xiao-dong He, and Di Li. 2018. From Eliza to XiaoIce: chal-lenges and opportunities with social chatbots. Frontiers of Information Technology & Electronic Engineering 19, 1 (2018), 10--26.

[25]

Xu Tan, Tao Qin, Frank Soong, and Tie-Yan Liu. [n. d.]. A Survey on Neural Speech Synthesis. ([n. d.]).

[26]

Yao-Hung Hubert Tsai, Shaojie Bai, Paul Pu Liang, J Zico Kolter, Louis-Philippe Morency, and Ruslan Salakhutdinov. 2019. Multimodal transformer for unaligned multimodal language sequences. In Proceedings of the conference. Association for Computational Linguistics. Meeting, Vol. 2019. NIH Public Access, 6558.

[27]

Yuxuan Wang, RJ Skerry-Ryan, Daisy Stanton, Yonghui Wu, Ron J Weiss, Zongheng Yang Jaitly, Ying Xiao, Zhifeng Chen, Samy Bengio, Quoc Le, et al. 2017. Tacotron: Towards End-to-End Speech Synthesis. (2017).

[28]

Yuxuan Wang, Daisy Stanton, Yu Zhang, RJ-Skerry Ryan, Eric Battenberg, Joel Shor, Ying Xiao, Ye Jia, Fei Ren, and Rif A Saurous. 2018. Style tokens: Unsupervised style modeling, control and transfer in end-to-end speech synthesis. In International Conference on Machine Learning. PMLR, 5180--5189.

[29]

Pengfei Wu, Zhenhua Ling, Lijuan Liu, Yuan Jiang, Hongchuan Wu, and Lirong Dai. 2019. End-to-end emotional speech synthesis using style tokens and semi-supervised training. In 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC). IEEE, 623--627.

[30]

Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R Salakhutdinov, and Quoc V Le. 2019. Xlnet: Generalized autoregressive pretraining for language understanding. Advances in neural information processing systems 32 (2019).

[31]

Li Zhou, Jianfeng Gao, Di Li, and Heung-Yeung Shum. 2020. The design and im-plementation of xiaoice, an empathetic social chatbot. Computational Linguistics 46, 1 (2020), 53--93.

Digital Library

Cited By

Liu RHu YRen YYin XLi HCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Generative Expressive Conversational Speech SynthesisProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681697(4187-4196)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681697
Yang ZWang CWu ZJia J(2024)Inferring Agent Speaking Styles for Auditory-Visual User-Agent Conversation2024 IEEE 14th International Symposium on Chinese Spoken Language Processing (ISCSLP)10.1109/ISCSLP63861.2024.10800066(421-425)Online publication date: 7-Nov-2024
https://doi.org/10.1109/ISCSLP63861.2024.10800066

Index Terms

Emotionally Situated Text-to-Speech Synthesis in User-Agent Conversation
1. Information systems
  1. Information systems applications
    1. Multimedia information systems
      1. Multimedia content creation

Recommendations

Prosody modification for speech recognition in emotionally mismatched conditions

A degradation in the performance of automatic speech recognition systems (ASR) is observed in mismatched training and testing conditions. One of the reasons for this degradation is due to the presence of emotions in the speech. The main objective of ...
Entrainment in Human-Agent Text Communication
Agent Computing and Multi-Agent Systems

Non-verbal information such as utterance speed and switching pause create an impression of the speaker. If intelligent agents could handle such non-verbal information properly, the quality of interactions between agents and human users would improve. ...
Effect of Speech Entrainment in Human-Computer Conversation: A Review
Intelligent Human Computer Interaction
Abstract
The phenomenon of entrainment in conversation is the process where participants become more similar to each other in terms of different verbal and non-verbal aspects such as acoustic-prosodic, lexical, syntactic, pitch, and speech rate. This ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

October 2023

9913 pages

ISBN:9798400701085

DOI:10.1145/3581783

General Chairs:
Abdulmotaleb El Saddik
University of Ottawa, Canada & MBZUAI, UAE
,
Tao Mei
HiDream.ai, China
,
Rita Cucchiara
University of Modena and Reggio Emilia, Italy
,
Program Chairs:
Marco Bertini
University of Florence, Italy
,
Diana Patricia Tobon Vallejo
Unversidad de Medellin, Colombia
,
Pradeep K. Atrey
University at Albany, State University of New York, USA
,
M. Shamim Hossain
M. Shamim Hossain (King Saud University, KSA

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 October 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

MM '23

Sponsor:

SIGMM

MM '23: The 31st ACM International Conference on Multimedia

October 29 - November 3, 2023

Ottawa ON, Canada

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
286
Total Downloads

Downloads (Last 12 months)165
Downloads (Last 6 weeks)16

Reflects downloads up to 31 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Liu RHu YRen YYin XLi HCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Generative Expressive Conversational Speech SynthesisProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681697(4187-4196)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681697
Yang ZWang CWu ZJia J(2024)Inferring Agent Speaking Styles for Auditory-Visual User-Agent Conversation2024 IEEE 14th International Symposium on Chinese Spoken Language Processing (ISCSLP)10.1109/ISCSLP63861.2024.10800066(421-425)Online publication date: 7-Nov-2024
https://doi.org/10.1109/ISCSLP63861.2024.10800066

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten