Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3581783.3613823acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Emotionally Situated Text-to-Speech Synthesis in User-Agent Conversation

Published: 27 October 2023 Publication History

Abstract

Conversational Text-to-speech Synthesis (TTS) aims to generate speech with proper style in the user-agent conversation scenario. Although previous works have explored modeling the context in the dialogue history to provide style information for the agent, there are still deficiencies in modeling the role-aware multi-modal context. Moreover, previous works ignore the emotional dependencies between the user and the agent, which includes: 1) agent understands emotional states of users, and 2) agent expresses proper emotion in the generated speech. In this work, we propose an Emotionally Situated Text-to-speech Synthesis (EmoSit-TTS) framework to understand users' semantics and subtle emotional states, and generate speech with proper speaking style and emotional expression in the user-agent conversation. Experiments on the DailyTalk dataset show the superiority of our proposed framework for the user-agent conversational TTS, especially in terms of emotion-aware expressiveness, which outperforms other state-of-the-art methods by 0.69 on MOS. Demos of our proposed framework are available at https://anonydemo.github.io.

References

[1]
Jian Cong, Shan Yang, Na Hu, Guangzhi Li, Lei Xie, and Dan Su. 2021. Controllable context-aware conversational speech synthesis. arXiv preprint arXiv:2106.10828 (2021).
[2]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
[3]
Deepanway Ghosal, Navonil Majumder, Soujanya Poria, Niyati Chhaya, and Alexander Gelbukh. [n. d.]. DialogueGCN: A Graph Convolutional Neural Network for Emotion Recognition in Conversation. ([n. d.]).
[4]
Haohan Guo, Shaofei Zhang, Frank K Soong, Lei He, and Lei Xie. 2021. Conversa-tional end-to-end tts for voice agents. In 2021 IEEE Spoken Language Technology Workshop (SLT). IEEE, 403--409.
[5]
Matthew B Hoy. 2018. Alexa, Siri, Cortana, and more: an introduction to voice assistants. Medical reference services quarterly 37, 1 (2018), 81--88.
[6]
Jingwen Hu, Yuchen Liu, Jinming Zhao, and Qin Jin. 2021. Mmgcn: Multimodal fusion via deep graph convolution network for emotion recognition in conversation. arXiv preprint arXiv:2107.06779 (2021).
[7]
Yifan Hu, Rui Liu, Guanglai Gao, and Haizhou Li. 2022. FCTalker: Fine and Coarse Grained Context Modeling for Expressive Conversational Speech Synthesis. arXiv e-prints (2022), arXiv-2210.
[8]
Andrew Jaegle, Felix Gimeno, Andy Brock, Oriol Vinyals, Andrew Zisserman, and Joao Carreira. 2021. Perceiver: General perception with iterative attention. In International conference on machine learning. PMLR, 4651--4664.
[9]
Jungil Kong, Jaehyeon Kim, and Jaekyoung Bae. 2020. Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis. Advances in Neural Information Processing Systems 33 (2020), 17022--17033.
[10]
Keon Lee, Kyumin Park, and Daeyoung Kim. 2022. DailyTalk: Spoken Dialogue Dataset for Conversational Text-to-Speech. arXiv e-prints (2022), arXiv-2207.
[11]
Jingbei Li, Yi Meng, Chenyi Li, Zhiyong Wu, Helen Meng, Chao Weng, and Dan Su. 2022. Enhancing Speaking Styles in Conversational Text-to-Speech Synthesis with Graph-Based Multi-Modal Context Modeling. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 7917--7921.
[12]
Jingbei Li, Yi Meng, Xixin Wu, Zhiyong Wu, Jia Jia, Helen Meng, Qiao Tian, Yuping Wang, and Yuxuan Wang. 2022. Inferring Speaking Styles from Multi-modal Conversational Context by Multi-scale Relational Graph Convolutional Networks. In Proceedings of the 30th ACM International Conference on Multimedia, Lisboa, Portugal.
[13]
Yanran Li, Hui Su, Xiaoyu Shen, Wenjie Li, Ziqiang Cao, and Shuzi Niu. 2017. DailyDialog: A Manually Labelled Multi-turn Dialogue Dataset. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 986--995.
[14]
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).
[15]
Yuchen Liu, Jinming Zhao, Jingwen Hu, Ruichen Li, and Qin Jin. 2022. DialogueEIN: Emotion Interaction Network for Dialogue Affective Analysis. In Proceedings of the 29th International Conference on Computational Linguistics. 684--693.
[16]
Ilya Loshchilov and Frank Hutter. [n. d.]. DECOUPLED WEIGHT DECAY REGULARIZATION. ([n. d.]).
[17]
Navonil Majumder, Soujanya Poria, Devamanyu Hazarika, Rada Mihalcea, Alexander Gelbukh, and Erik Cambria. 2019. Dialoguernn: An attentive rnn for emotion detection in conversations. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 6818--6825.
[18]
Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. 2016. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499 (2016).
[19]
Yi Ren, Chenxu Hu, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. 2020. FastSpeech 2: Fast and High-Quality End-to-End Text to Speech. In International Conference on Learning Representations.
[20]
Yi Ren, Yangjun Ruan, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. 2019. Fastspeech: Fast, robust and controllable text to speech. Advances in Neural Information Processing Systems 32 (2019).
[21]
Matej Rojc, Marko Presker, Zdravko Kačić, and Izidor Mlakar. 2014. TTS-driven expressive embodied conversation agent EVA for UMB-SmartTV. International journal of computers and communications 8 (2014), 57--66.
[22]
Jonathan Shen, Ruoming Pang, Ron J Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, Rj Skerrv-Ryan, et al. 2018. Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 4779--4783.
[23]
Weizhou Shen, Junqing Chen, Xiaojun Quan, and Zhixian Xie. 2020. DialogXL: All-in-one XLNet for multi-party conversation emotion recognition. arXiv preprint arXiv:2012.08695 (2020).
[24]
Heung-Yeung Shum, Xiao-dong He, and Di Li. 2018. From Eliza to XiaoIce: chal-lenges and opportunities with social chatbots. Frontiers of Information Technology & Electronic Engineering 19, 1 (2018), 10--26.
[25]
Xu Tan, Tao Qin, Frank Soong, and Tie-Yan Liu. [n. d.]. A Survey on Neural Speech Synthesis. ([n. d.]).
[26]
Yao-Hung Hubert Tsai, Shaojie Bai, Paul Pu Liang, J Zico Kolter, Louis-Philippe Morency, and Ruslan Salakhutdinov. 2019. Multimodal transformer for unaligned multimodal language sequences. In Proceedings of the conference. Association for Computational Linguistics. Meeting, Vol. 2019. NIH Public Access, 6558.
[27]
Yuxuan Wang, RJ Skerry-Ryan, Daisy Stanton, Yonghui Wu, Ron J Weiss, Zongheng Yang Jaitly, Ying Xiao, Zhifeng Chen, Samy Bengio, Quoc Le, et al. 2017. Tacotron: Towards End-to-End Speech Synthesis. (2017).
[28]
Yuxuan Wang, Daisy Stanton, Yu Zhang, RJ-Skerry Ryan, Eric Battenberg, Joel Shor, Ying Xiao, Ye Jia, Fei Ren, and Rif A Saurous. 2018. Style tokens: Unsupervised style modeling, control and transfer in end-to-end speech synthesis. In International Conference on Machine Learning. PMLR, 5180--5189.
[29]
Pengfei Wu, Zhenhua Ling, Lijuan Liu, Yuan Jiang, Hongchuan Wu, and Lirong Dai. 2019. End-to-end emotional speech synthesis using style tokens and semi-supervised training. In 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC). IEEE, 623--627.
[30]
Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R Salakhutdinov, and Quoc V Le. 2019. Xlnet: Generalized autoregressive pretraining for language understanding. Advances in neural information processing systems 32 (2019).
[31]
Li Zhou, Jianfeng Gao, Di Li, and Heung-Yeung Shum. 2020. The design and im-plementation of xiaoice, an empathetic social chatbot. Computational Linguistics 46, 1 (2020), 53--93.

Cited By

View all
  • (2024)Generative Expressive Conversational Speech SynthesisProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681697(4187-4196)Online publication date: 28-Oct-2024
  • (2024)Inferring Agent Speaking Styles for Auditory-Visual User-Agent Conversation2024 IEEE 14th International Symposium on Chinese Spoken Language Processing (ISCSLP)10.1109/ISCSLP63861.2024.10800066(421-425)Online publication date: 7-Nov-2024

Index Terms

  1. Emotionally Situated Text-to-Speech Synthesis in User-Agent Conversation

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MM '23: Proceedings of the 31st ACM International Conference on Multimedia
    October 2023
    9913 pages
    ISBN:9798400701085
    DOI:10.1145/3581783
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 27 October 2023

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. emotional text-to-speech in dialogue
    2. multi-modal dialogue understanding
    3. user-agent conversational text-to-speech

    Qualifiers

    • Research-article

    Conference

    MM '23
    Sponsor:
    MM '23: The 31st ACM International Conference on Multimedia
    October 29 - November 3, 2023
    Ottawa ON, Canada

    Acceptance Rates

    Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)165
    • Downloads (Last 6 weeks)16
    Reflects downloads up to 31 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Generative Expressive Conversational Speech SynthesisProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681697(4187-4196)Online publication date: 28-Oct-2024
    • (2024)Inferring Agent Speaking Styles for Auditory-Visual User-Agent Conversation2024 IEEE 14th International Symposium on Chinese Spoken Language Processing (ISCSLP)10.1109/ISCSLP63861.2024.10800066(421-425)Online publication date: 7-Nov-2024

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media