research-article

Multi-modal multi-hop interaction network for dialogue response generation

Authors:

Xuanjing HuangAuthors Info & Claims

Volume 227, Issue C

https://doi.org/10.1016/j.eswa.2023.120267

Published: 01 October 2023 Publication History

Abstract

Most task-oriented dialogue systems generate informative and appropriate responses by leveraging structured knowledge bases which, in practise, are not always available. For instance, in the e-commerce scenario, commercial items often miss key attribute values, while containing abundant unstructured multi-modal information, e.g., text description and images. Previous studies have not fully explored such information for dialogue response generation. In this paper, we propose a Multi-modal multi-hop Interaction Network for Dialogue (MIND) to facilitate 1) the interaction between a query and multi-modal information through the query-aware multi-modal encoder and 2) the interaction between modalities through the multi-hop decoder. We conduct extensive experiments to demonstrate the effectiveness of MIND over strong baselines, which achieves state-of-the-art performance for automatic and human evaluation. We also release two real-world large-scale datasets containing both dialogue history and items’ multi-modal information to facilitate future research.

Highlights

•

Learn structured knowledge directly from items’ multi-modal illustration.

•

Propose a multi-modal interaction dialogue generation model.

•

Release two real large-scale datasets to facilitate future research.

References

[1]

Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M., Gould, S., et al. (2018). Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of CVPR (pp. 6077–6086).

[2]

Budzianowski, P., Wen, T.-H., Tseng, B.-H., Casanueva, I., Ultes, S., Ramadan, O., et al. (2018). MultiWOZ-A Large-Scale Multi-Domain Wizard-of-Oz Dataset for Task-Oriented Dialogue Modelling. In Proceedings of EMNLP (pp. 5016–5026).

[3]

Chauhan, H., Firdaus, M., Ekbal, A., & Bhattacharyya, P. (2019). Ordinal and attribute aware response generation in a multimodal dialogue system. In Proceedings of the 57th annual meeting of the association for computational linguistics (pp. 5437–5447).

[4]

Chen Y.-C., Li L., Yu L., Kholy A.E., Ahmed F., Gan Z., et al., Uniter: Universal image-text representation learning, in: ECCV, 2020.

[5]

Chen, Q., Zhu, X., Ling, Z.-H., Wei, S., Jiang, H., & Inkpen, D. (2017). Enhanced LSTM for Natural Language Inference. In Proceedings of the 55th annual meeting of the association for computational linguistics (Volume 1: Long Papers) (pp. 1657–1668).

[6]

Conover W.J., Practical nonparametric statistics, Vol. 350, John Wiley & Sons, 1999.

[7]

Cui, C., Wang, W., Song, X., Huang, M., Xu, X.-S., & Nie, L. (2019). User attention-guided multimodal dialog systems. In Proceedings of SIGIR (pp. 445–454).

[8]

Das, A., Kottur, S., Gupta, K., Singh, A., Yadav, D., Moura, J. M., et al. (2017). Visual dialog. In Proceedings of CVPR (pp. 326–335).

[9]

Hosseini-Asl E., McCann B., Wu C.-S., Yavuz S., Socher R., A simple language model for task-oriented dialogue, 2020, arXiv preprint arXiv:2005.00796.

[10]

Krishna R., Zhu Y., Groth O., Johnson J., Hata K., Kravitz J., et al., Visual genome: Connecting language and vision using crowdsourced dense image annotations, International Journal of Computer Vision 123 (1) (2017) 32–73.

Digital Library

[11]

Lei, W., Jin, X., Kan, M.-Y., Ren, Z., He, X., & Yin, D. (2018). Sequicity: Simplifying task-oriented dialogue systems with single sequence-to-sequence architectures. In Proceedings of ACL (pp. 1437–1447).

[12]

Li, Z., Kiseleva, J., & de Rijke, M. (2021). Improving Response Quality with Backward Reasoning in Open-domain Dialogue Systems. In SIGIR ’21: The 44th International ACM SIGIR conference on research and development in information retrieval, virtual event, Canada, July 11-15, 2021 (pp. 1940–1944).

[13]

Li, D., Ren, Z., Ren, P., Chen, Z., Fan, M., Ma, J., et al. (2021). Semi-Supervised Variational Reasoning for Medical Dialogue Generation. In SIGIR ’21: The 44th international ACM SIGIR conference on research and development in information retrieval, virtual event, Canada, July 11-15, 2021 (pp. 544–554).

[14]

Li, H., Yuan, P., Xu, S., Wu, Y., He, X., & Zhou, B. (2020). Aspect-Aware Multimodal Summarization for Chinese E-Commerce Products. In Proceedings of AAAI (pp. 8188–8195).

[15]

Liao, L., Ma, Y., He, X., Hong, R., & Chua, T.-s. (2018). Knowledge-aware multimodal dialogue systems. In Proceedings of MM (pp. 801–809).

[16]

Lin C.-Y., ROUGE: A package for automatic evaluation of summaries, in: Text summarization branches out, 2004, pp. 74–81.

[17]

Lu, J., Yang, J., Batra, D., & Parikh, D. (2016). Hierarchical question-image co-attention for visual question answering. In Proceedings of NeurIPS (pp. 289–297).

[18]

McHugh M.L., et al., Interrater reliability: the kappa statistic, Biochemia Medica 22 (3) (2012) 276–282.

[19]

Meng Y., Wang S., Han Q., Sun X., Wu F., Yan R., et al., OpenViDial: A large-scale, open-domain dialogue dataset with visual contexts, 2020, arXiv preprint arXiv:2012.15015.

[20]

Nie, L., Wang, W., Hong, R., Wang, M., & Tian, Q. (2019). Multimodal dialog system: Generating responses via adaptive decoders. In Proceedings of MM (pp. 1098–1106).

[21]

Novikova, J., Dušek, O., Cercas Curry, A., & Rieser, V. (2017). Why We Need New Evaluation Metrics for NLG. In Proceedings of EMNLP (pp. 2241–2252).

[22]

Papineni, K., Roukos, S., Ward, T., & Zhu, W.-J. (2002). Bleu: a Method for Automatic Evaluation of Machine Translation. In Proceedings of ACL (pp. 311–318).

[23]

Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster r-cnn: Towards real-time object detection with region proposal networks. In Proceedings of NeurIPS (pp. 91–99).

[24]

Saha, A., Khapra, M., & Sankaranarayanan, K. (2018). Towards building large scale multimodal domain-aware conversation systems. In Proceedings of AAAI, Vol. 32.

[25]

See, A., Liu, P. J., & Manning, C. D. (2017). Get To The Point: Summarization with Pointer-Generator Networks. In Proceedings of ACL (pp. 1073–1083).

[26]

Seo, M. J., Kembhavi, A., Farhadi, A., & Hajishirzi, H. (2017). Bidirectional Attention Flow for Machine Comprehension. In Proceedings of ICLR.

[27]

Shuster, K., Humeau, S., Bordes, A., & Weston, J. (2020). Image-chat: Engaging grounded conversations. In Proceedings of ACL (pp. 2414–2429).

[28]

Song, L., Yao, M., Bi, Y., Wu, Z., Wang, J., Xiao, J., et al. (2021). LS-DST: Long and Sparse Dialogue State Tracking with Smart History Collector in Insurance Marketing. In SIGIR ’21: The 44th international ACM SIGIR conference on research and development in information retrieval, virtual event, Canada, July 11-15, 2021 (pp. 1960–1964).

[29]

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., et al. (2017). Attention is all you need. In Proceedings of NeurIPS (pp. 5998–6008).

[30]

Vedantam, R., Lawrence Zitnick, C., & Parikh, D. (2015). Cider: Consensus-based image description evaluation. In Proceedings of CVPR (pp. 4566–4575).

[31]

Wang Z., Yu J., Yu A.W., Dai Z., Tsvetkov Y., Cao Y., Simvlm: Simple visual language model pretraining with weak supervision, in: The tenth international conference on learning representations, ICLR 2022, virtual event, April 25-29, 2022, OpenReview.net, 2022, URL: https://openreview.net/forum?id=GUrhfTuf_3.

[32]

Wang Z., Zhang X., Tan Y., Chinese sentences similarity via cross-attention based siamese network, 2021, arXiv preprint arXiv:2104.08787.

[33]

Wen, T.-H., Vandyke, D., Mrkšić, N., Gasic, M., Barahona, L. M. R., Su, P.-H., et al. (2017). A Network-based End-to-End Trainable Task-oriented Dialogue System. In Proceedings of EACL (pp. 438–449).

[34]

Weston J., Chopra S., Bordes A., Memory networks, 2014, arXiv preprint arXiv:1410.3916.

[35]

Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., et al. (2015). Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of ICML (pp. 2048–2057).

[36]

Ye J., Zhou J., Tian J., Wang R., Zhou J., Gui T., et al., Sentiment-aware multimodal pre-training for multimodal sentiment analysis, Knowledge-Based Systems 258 (2022).

[37]

Zhang T., Kishore V., Wu F., Weinberger K.Q., Artzi Y., Bertscore: Evaluating text generation with BERT, in: 8th international conference on learning representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020, OpenReview.net, 2020.

[38]

Zhang, Y., Ou, Z., & Yu, Z. (2020). Task-oriented dialog systems that consider multiple appropriate responses under the same context. In Proceedings of AAAI, Vol. 34 (pp. 9604–9611).

[39]

Zhang, Y., Sun, S., Galley, M., Chen, Y.-C., Brockett, C., Gao, X., et al. (2020). DialoGPT: Large-Scale Generative Pre-training for Conversational Response Generation. In Proceedings of ACL, system demonstration.

Cited By

Horák ASabol RHerman OBaisa V(2024)Recognition of propaganda techniques in newspaper textsExpert Systems with Applications: An International Journal10.1016/j.eswa.2024.124085251:COnline publication date: 24-Jul-2024
https://dl.acm.org/doi/10.1016/j.eswa.2024.124085
Liu JCheng PDai JLiu J(2024)A variational selection mechanism for article comment generationExpert Systems with Applications: An International Journal10.1016/j.eswa.2023.121263237:PCOnline publication date: 1-Mar-2024
https://dl.acm.org/doi/10.1016/j.eswa.2023.121263

Recommendations

A multi-modal human robot interaction framework based on cognitive behavioral therapy model
H3 '18: Proceedings of the Workshop on Human-Habitat for Health (H3): Human-Habitat Multimodal Interaction for Promoting Health and Well-Being in the Internet of Things Era

According to recent statistics, depression and suicide are on a rise in the United States and elsewhere. To resolve this is- sue, synonymous to various current approaches, we propose a multi-modal robot interaction framework, which will act as an ...
Transformer-Based Interactive Multi-Modal Attention Network for Video Sentiment Detection
Abstract
Social media allows users to express opinions in multiple modalities such as text, pictures, and short-videos. Multi-modal sentiment detection can more effectively predict the emotional tendencies expressed by users. Therefore, multi-modal ...
Automatic Generation of Multi-Modal Dialogue from Text Based on Discourse Structure Analysis
ICSC '07: Proceedings of the International Conference on Semantic Computing

In this paper, we propose a novel method for generating engaging multi-modal content automatically from text. Rhetorical Structure Theory (RST) is used to decompose text into discourse units and to identify rhetorical discourse relations between them. ...

Comments

Information & Contributors

Information

Published In

cover image Expert Systems with Applications: An International Journal

Expert Systems with Applications: An International Journal Volume 227, Issue C

Oct 2023

1538 pages

ISSN:0957-4174

Issue’s Table of Contents

Elsevier Ltd.

Publisher

Pergamon Press, Inc.

United States

Publication History

Published: 01 October 2023

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 12 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Horák ASabol RHerman OBaisa V(2024)Recognition of propaganda techniques in newspaper textsExpert Systems with Applications: An International Journal10.1016/j.eswa.2024.124085251:COnline publication date: 24-Jul-2024
https://dl.acm.org/doi/10.1016/j.eswa.2024.124085
Liu JCheng PDai JLiu J(2024)A variational selection mechanism for article comment generationExpert Systems with Applications: An International Journal10.1016/j.eswa.2023.121263237:PCOnline publication date: 1-Mar-2024
https://dl.acm.org/doi/10.1016/j.eswa.2023.121263

View Options

View options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Issue’s Table of Contents