Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Multi-modal multi-hop interaction network for dialogue response generation

Published: 01 October 2023 Publication History

Abstract

Most task-oriented dialogue systems generate informative and appropriate responses by leveraging structured knowledge bases which, in practise, are not always available. For instance, in the e-commerce scenario, commercial items often miss key attribute values, while containing abundant unstructured multi-modal information, e.g., text description and images. Previous studies have not fully explored such information for dialogue response generation. In this paper, we propose a Multi-modal multi-hop Interaction Network for Dialogue (MIND) to facilitate 1) the interaction between a query and multi-modal information through the query-aware multi-modal encoder and 2) the interaction between modalities through the multi-hop decoder. We conduct extensive experiments to demonstrate the effectiveness of MIND over strong baselines, which achieves state-of-the-art performance for automatic and human evaluation. We also release two real-world large-scale datasets containing both dialogue history and items’ multi-modal information to facilitate future research.

Highlights

Learn structured knowledge directly from items’ multi-modal illustration.
Propose a multi-modal interaction dialogue generation model.
Release two real large-scale datasets to facilitate future research.

References

[1]
Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M., Gould, S., et al. (2018). Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of CVPR (pp. 6077–6086).
[2]
Budzianowski, P., Wen, T.-H., Tseng, B.-H., Casanueva, I., Ultes, S., Ramadan, O., et al. (2018). MultiWOZ-A Large-Scale Multi-Domain Wizard-of-Oz Dataset for Task-Oriented Dialogue Modelling. In Proceedings of EMNLP (pp. 5016–5026).
[3]
Chauhan, H., Firdaus, M., Ekbal, A., & Bhattacharyya, P. (2019). Ordinal and attribute aware response generation in a multimodal dialogue system. In Proceedings of the 57th annual meeting of the association for computational linguistics (pp. 5437–5447).
[4]
Chen Y.-C., Li L., Yu L., Kholy A.E., Ahmed F., Gan Z., et al., Uniter: Universal image-text representation learning, in: ECCV, 2020.
[5]
Chen, Q., Zhu, X., Ling, Z.-H., Wei, S., Jiang, H., & Inkpen, D. (2017). Enhanced LSTM for Natural Language Inference. In Proceedings of the 55th annual meeting of the association for computational linguistics (Volume 1: Long Papers) (pp. 1657–1668).
[6]
Conover W.J., Practical nonparametric statistics, Vol. 350, John Wiley & Sons, 1999.
[7]
Cui, C., Wang, W., Song, X., Huang, M., Xu, X.-S., & Nie, L. (2019). User attention-guided multimodal dialog systems. In Proceedings of SIGIR (pp. 445–454).
[8]
Das, A., Kottur, S., Gupta, K., Singh, A., Yadav, D., Moura, J. M., et al. (2017). Visual dialog. In Proceedings of CVPR (pp. 326–335).
[9]
Hosseini-Asl E., McCann B., Wu C.-S., Yavuz S., Socher R., A simple language model for task-oriented dialogue, 2020, arXiv preprint arXiv:2005.00796.
[10]
Krishna R., Zhu Y., Groth O., Johnson J., Hata K., Kravitz J., et al., Visual genome: Connecting language and vision using crowdsourced dense image annotations, International Journal of Computer Vision 123 (1) (2017) 32–73.
[11]
Lei, W., Jin, X., Kan, M.-Y., Ren, Z., He, X., & Yin, D. (2018). Sequicity: Simplifying task-oriented dialogue systems with single sequence-to-sequence architectures. In Proceedings of ACL (pp. 1437–1447).
[12]
Li, Z., Kiseleva, J., & de Rijke, M. (2021). Improving Response Quality with Backward Reasoning in Open-domain Dialogue Systems. In SIGIR ’21: The 44th International ACM SIGIR conference on research and development in information retrieval, virtual event, Canada, July 11-15, 2021 (pp. 1940–1944).
[13]
Li, D., Ren, Z., Ren, P., Chen, Z., Fan, M., Ma, J., et al. (2021). Semi-Supervised Variational Reasoning for Medical Dialogue Generation. In SIGIR ’21: The 44th international ACM SIGIR conference on research and development in information retrieval, virtual event, Canada, July 11-15, 2021 (pp. 544–554).
[14]
Li, H., Yuan, P., Xu, S., Wu, Y., He, X., & Zhou, B. (2020). Aspect-Aware Multimodal Summarization for Chinese E-Commerce Products. In Proceedings of AAAI (pp. 8188–8195).
[15]
Liao, L., Ma, Y., He, X., Hong, R., & Chua, T.-s. (2018). Knowledge-aware multimodal dialogue systems. In Proceedings of MM (pp. 801–809).
[16]
Lin C.-Y., ROUGE: A package for automatic evaluation of summaries, in: Text summarization branches out, 2004, pp. 74–81.
[17]
Lu, J., Yang, J., Batra, D., & Parikh, D. (2016). Hierarchical question-image co-attention for visual question answering. In Proceedings of NeurIPS (pp. 289–297).
[18]
McHugh M.L., et al., Interrater reliability: the kappa statistic, Biochemia Medica 22 (3) (2012) 276–282.
[19]
Meng Y., Wang S., Han Q., Sun X., Wu F., Yan R., et al., OpenViDial: A large-scale, open-domain dialogue dataset with visual contexts, 2020, arXiv preprint arXiv:2012.15015.
[20]
Nie, L., Wang, W., Hong, R., Wang, M., & Tian, Q. (2019). Multimodal dialog system: Generating responses via adaptive decoders. In Proceedings of MM (pp. 1098–1106).
[21]
Novikova, J., Dušek, O., Cercas Curry, A., & Rieser, V. (2017). Why We Need New Evaluation Metrics for NLG. In Proceedings of EMNLP (pp. 2241–2252).
[22]
Papineni, K., Roukos, S., Ward, T., & Zhu, W.-J. (2002). Bleu: a Method for Automatic Evaluation of Machine Translation. In Proceedings of ACL (pp. 311–318).
[23]
Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster r-cnn: Towards real-time object detection with region proposal networks. In Proceedings of NeurIPS (pp. 91–99).
[24]
Saha, A., Khapra, M., & Sankaranarayanan, K. (2018). Towards building large scale multimodal domain-aware conversation systems. In Proceedings of AAAI, Vol. 32.
[25]
See, A., Liu, P. J., & Manning, C. D. (2017). Get To The Point: Summarization with Pointer-Generator Networks. In Proceedings of ACL (pp. 1073–1083).
[26]
Seo, M. J., Kembhavi, A., Farhadi, A., & Hajishirzi, H. (2017). Bidirectional Attention Flow for Machine Comprehension. In Proceedings of ICLR.
[27]
Shuster, K., Humeau, S., Bordes, A., & Weston, J. (2020). Image-chat: Engaging grounded conversations. In Proceedings of ACL (pp. 2414–2429).
[28]
Song, L., Yao, M., Bi, Y., Wu, Z., Wang, J., Xiao, J., et al. (2021). LS-DST: Long and Sparse Dialogue State Tracking with Smart History Collector in Insurance Marketing. In SIGIR ’21: The 44th international ACM SIGIR conference on research and development in information retrieval, virtual event, Canada, July 11-15, 2021 (pp. 1960–1964).
[29]
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., et al. (2017). Attention is all you need. In Proceedings of NeurIPS (pp. 5998–6008).
[30]
Vedantam, R., Lawrence Zitnick, C., & Parikh, D. (2015). Cider: Consensus-based image description evaluation. In Proceedings of CVPR (pp. 4566–4575).
[31]
Wang Z., Yu J., Yu A.W., Dai Z., Tsvetkov Y., Cao Y., Simvlm: Simple visual language model pretraining with weak supervision, in: The tenth international conference on learning representations, ICLR 2022, virtual event, April 25-29, 2022, OpenReview.net, 2022, URL: https://openreview.net/forum?id=GUrhfTuf_3.
[32]
Wang Z., Zhang X., Tan Y., Chinese sentences similarity via cross-attention based siamese network, 2021, arXiv preprint arXiv:2104.08787.
[33]
Wen, T.-H., Vandyke, D., Mrkšić, N., Gasic, M., Barahona, L. M. R., Su, P.-H., et al. (2017). A Network-based End-to-End Trainable Task-oriented Dialogue System. In Proceedings of EACL (pp. 438–449).
[34]
Weston J., Chopra S., Bordes A., Memory networks, 2014, arXiv preprint arXiv:1410.3916.
[35]
Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., et al. (2015). Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of ICML (pp. 2048–2057).
[36]
Ye J., Zhou J., Tian J., Wang R., Zhou J., Gui T., et al., Sentiment-aware multimodal pre-training for multimodal sentiment analysis, Knowledge-Based Systems 258 (2022).
[37]
Zhang T., Kishore V., Wu F., Weinberger K.Q., Artzi Y., Bertscore: Evaluating text generation with BERT, in: 8th international conference on learning representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020, OpenReview.net, 2020.
[38]
Zhang, Y., Ou, Z., & Yu, Z. (2020). Task-oriented dialog systems that consider multiple appropriate responses under the same context. In Proceedings of AAAI, Vol. 34 (pp. 9604–9611).
[39]
Zhang, Y., Sun, S., Galley, M., Chen, Y.-C., Brockett, C., Gao, X., et al. (2020). DialoGPT: Large-Scale Generative Pre-training for Conversational Response Generation. In Proceedings of ACL, system demonstration.

Cited By

View all
  • (2024)Recognition of propaganda techniques in newspaper textsExpert Systems with Applications: An International Journal10.1016/j.eswa.2024.124085251:COnline publication date: 24-Jul-2024
  • (2024)A variational selection mechanism for article comment generationExpert Systems with Applications: An International Journal10.1016/j.eswa.2023.121263237:PCOnline publication date: 1-Mar-2024

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Expert Systems with Applications: An International Journal
Expert Systems with Applications: An International Journal  Volume 227, Issue C
Oct 2023
1538 pages

Publisher

Pergamon Press, Inc.

United States

Publication History

Published: 01 October 2023

Author Tags

  1. Dialogue response generation
  2. Multimodal
  3. Interaction

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 12 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Recognition of propaganda techniques in newspaper textsExpert Systems with Applications: An International Journal10.1016/j.eswa.2024.124085251:COnline publication date: 24-Jul-2024
  • (2024)A variational selection mechanism for article comment generationExpert Systems with Applications: An International Journal10.1016/j.eswa.2023.121263237:PCOnline publication date: 1-Mar-2024

View Options

View options

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media