research-article

Open access

Graph-based Multimodal Ranking Models for Multimodal Summarization

Authors:

Chengqing ZongAuthors Info & Claims

Transactions on Asian and Low-Resource Language Information Processing, Volume 20, Issue 4

Article No.: 60, Pages 1 - 21

https://doi.org/10.1145/3445794

Published: 26 May 2021 Publication History

All formats PDF

Abstract

Multimodal summarization aims to extract the most important information from the multimedia input. It is becoming increasingly popular due to the rapid growth of multimedia data in recent years. There are various researches focusing on different multimodal summarization tasks. However, the existing methods can only generate single-modal output or multimodal output. In addition, most of them need a lot of annotated samples for training, which makes it difficult to be generalized to other tasks or domains. Motivated by this, we propose a unified framework for multimodal summarization that can cover both single-modal output summarization and multimodal output summarization. In our framework, we consider three different scenarios and propose the respective unsupervised graph-based multimodal summarization models without the requirement of any manually annotated document-summary pairs for training: (1) generic multimodal ranking, (2) modal-dominated multimodal ranking, and (3) non-redundant text-image multimodal ranking. Furthermore, an image-text similarity estimation model is introduced to measure the semantic similarity between image and text. Experiments show that our proposed models outperform the single-modal summarization methods on both automatic and human evaluation metrics. Besides, our models can also improve the single-modal summarization with the guidance of the multimedia information. This study can be applied as the benchmark for further study on multimodal summarization task.

References

[1]

Kobus Barnard, Pinar Duygulu, David Forsyth, Nando de Freitas, David M Blei, and Michael I Jordan. 2003. Matching words and pictures. J. Mach. Learn. Res. 3, Feb. (2003), 1107–1135.

Digital Library

[2]

Jingwen Bian, Yang Yang, Hanwang Zhang, and Tat-Seng Chua. 2015. Multimedia summarization for social events in microblog stream. IEEE Trans. Multim. 17, 2 (2015), 216–228.

Digital Library

[3]

Asli Celikyilmaz, Antoine Bosselut, Xiaodong He, and Yejin Choi. 2018. Deep communicating agents for abstractive summarization. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT’18). 1662–1675.

[4]

Jingqiang Chen and Hai Zhuge. 2018. Abstractive text-image summarization using multi-modal attentional hierarchical RNN. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’18). 4046–4056.

[5]

Jinsoo Choi, Tae-Hyun Oh, and In So Kweon. 2018. Contextually customized video summaries via natural language. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV’18). 1718–1726.

[6]

Ronan Collobert, Jason Weston, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. 2011. Natural language processing (almost) from scratch. J. Mach. Learn. Res. 12, 1 (2011), 2493–2537.

Digital Library

[7]

Sandra Eliza Fontes De Avila, Ana Paula Brandão Lopes, Antonio da Luz Jr, and Arnaldo de Albuquerque Araújo. 2011. VSUMM: A mechanism designed to produce static video summaries and a novel evaluation method. Pattern Recog. Lett. 32, 1 (2011), 56–68.

Digital Library

[8]

E. Elhamifar, G. Sapiro, and R. Vidal. 2012. See all by looking at a few: Sparse modeling for finding representative objects. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’12). 1600–1607.

Digital Library

[9]

Günes Erkan and Dragomir R. Radev. 2004. LexRank: Graph-based lexical centrality as salience in text summarization. J. Artif. Intell. Res. 22 (2004), 457–479.

Digital Library

[10]

Georgios Evangelopoulos, Athanasia Zlatintsi, Alexandros Potamianos, Petros Maragos, Konstantinos Rapantzikos, Georgios Skoumas, and Yannis Avrithis. 2013. Multimodal saliency and fusion for movie summarization based on aural, visual, and textual attention. IEEE Trans. Multim. 15, 7 (2013), 1553–1568.

Digital Library

[11]

Fartash Faghri, David J. Fleet, Jamie Ryan Kiros, and Sanja Fidler. 2018. VSE++: Improving visual-semantic embeddings with hard negatives. In Proceedings of the British Machine Vision Conference (BMVC’18).

[12]

Andrea Frome, Greg S. Corrado, Jon Shlens, Samy Bengio, Jeff Dean, Tomas Mikolov et al. 2013. DeVISE: A deep visual-semantic embedding model. In Proceedings of the International Conference on Advances in Neural Information Processing Systems (NIPS’13). 2121–2129.

Digital Library

[13]

Michael Gygli, Helmut Grabner, and Luc Van Gool. 2015. Video summarization by learning submodular mixtures of objectives. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 3090–3098.

[14]

Xiaofei He, Wei-Ying Ma, and Hongjiang Zhang. 2003. ImageRank: Spectral techniques for structural analysis of image database. In Proceedings of the International Conference on Multimedia and Expo (ICME’03). IEEE, I–25.

Digital Library

[15]

Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 3128–3137.

[16]

George Karypis. 2001. Evaluation of item-based top-n recommendation algorithms. In Proceedings of the 10th International Conference on Information and Knowledge Management. 247–254.

Digital Library

[17]

Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In Proceedings of the International Conference on Learning Representations (ICLR’15).

[18]

Haoran Li, Junnan Zhu, Tianshang Liu, Jiajun Zhang, and Chengqing Zong. 2018. Multi-modal sentence summarization with modality attention and image filtering. In Proceedings of the 27th International Joint Conference on Artificial Intelligence (IJCAI’18). International Joint Conferences on Artificial Intelligence Organization, 4152–4158.

Digital Library

[19]

Haoran Li, Junnan Zhu, Cong Ma, Jiajun Zhang, and Chengqing Zong. 2017. Multi-modal summarization for asynchronous collection of text, image, audio and video. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’17). 1092–1102.

[20]

Haoran Li, Junnan Zhu, Cong Ma, Jiajun Zhang, and Chengqing Zong. 2018. Read, watch, listen and summarize: multi-modal summarization for asynchronous text, image, audio and video. IEEE Trans. Knowl. Data Eng. 31, 5 (2018).

[21]

Haoran Li, Junnan Zhu, Jiajun Zhang, Xiaodong He, and Chengqing Zong. 2020. Multimodal sentence summarization via multimodal selective encoding. In Proceedings of the 28th International Conference on Computational Linguistics (COLING’20). 5655–5667.

[22]

Haoran Li, Junnan Zhu, Jiajun Zhang, Chengqing Zong, and Xiaodong He. 2020. Keywords-guided abstractive sentence summarization. In Proceedings of the 34th AAAI Conference on Artificial Intelligence (AAAI’20). 8.

[23]

Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out. ACL.

[24]

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common objects in context. In Proceedings of the European Conference on Computer Vision (ECCV’14). Springer, 740–755.

[25]

Ioannis Mademlis, Anastasios Tefas, Nikos Nikolaidis, and Ioannis Pitas. 2016. Multimodal stereoscopic movie summarization conforming to narrative characteristics. IEEE Trans. Image Proc. 25, 12 (2016), 5828–5840.

Digital Library

[26]

Niluthpol Chowdhury Mithun, Juncheng Li, Florian Metze, and Amit K. Roy-Chowdhury. 2018. Learning joint embedding with multimodal cues for cross-modal video-text retrieval. In Proceedings of the ACM International Conference on Multimedia Retrieval (ICMR’18). 19–27.

Digital Library

[27]

Rameswar Panda, Niluthpol Chowdhury Mithun, and Amit K. Roy-Chowdhury. 2017. Diversity-aware multi-video summarization. IEEE Trans. Image Proc. 26, 10 (2017), 4712–4724.

Digital Library

[28]

Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. 2013. On the difficulty of training recurrent neural networks. In Proceedings of the International Conference on Machine Learning (ICML’13). 1310–1318.

Digital Library

[29]

Romain Paulus, Caiming Xiong, and Richard Socher. 2018. A deep reinforced model for abstractive summarization. In Proceedings of the International Conference on Learning Representations (ICLR’18).

[30]

Bryan A. Plummer, Matthew Brown, and Svetlana Lazebnik. 2017. Enhancing video summarization via vision-language embedding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). 5781–5789.

[31]

Shengsheng Qian, Tianzhu Zhang, and Changsheng Xu. 2016. Multi-modal multi-view topic-opinion mining for social event analysis. In Proceedings of the 24th ACM International Conference on Multimedia (ACM MM’16). ACM, 2–11.

Digital Library

[32]

Alexander M. Rush, Sumit Chopra, and Jason Weston. 2015. A neural attention model for abstractive sentence summarization. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’15). Association for Computational Linguistics, 379–389.

[33]

Eugene Seneta. 2006. Non-negative Matrices and Markov Chains. Springer Science & Business Media.

[34]

Vasu Sharma, Akshay Kumar, Nishant Agrawal, Puneet Singh, and Rajat Kulshreshtha. 2015. Image summarization using topic modelling. In Proceedings of the IEEE International Conference on Signal and Image Processing Applications (ICSIPA’15). IEEE, 226–231.

[35]

Ian Simon, Noah Snavely, and Steven M. Seitz. 2007. Scene summarization for online image collections. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’07). IEEE, 1–8.

[36]

Karen Simonyan and Andrew Zisserman. 2015. Very deep convolutional networks for large-scale image recognition. In Proceedings of the International Conference on Learning Representations (ICLR’15).

[37]

Pinaki Sinha, Hamed Pirsiavash, and Ramesh Jain. 2009. Personal photo album summarization. In Proceedings of the 17th ACM International Conference on Multimedia (ACM MM’09). ACM, 1131–1132.

Digital Library

[38]

Richard Socher, Andrej Karpathy, Quoc V. Le, Christopher D. Manning, and Andrew Y. Ng. 2014. Grounded compositional semantics for finding and describing images with sentences. Trans. Assoc. Comput. Ling. 2 (2014), 207–218.

[39]

Yao-Hung Hubert Tsai, Shaojie Bai, Paul Pu Liang, J. Zico Kolter, Louis-Philippe Morency, and Ruslan Salakhutdinov. 2019. Multimodal transformer for unaligned multimodal language sequences. In Proceedings of the 57th Meeting of the Association for Computational Linguistics (ACL’19). 6558–6569.

[40]

Xiaojun Wan and Jianwu Yang. 2006. Improved affinity graph based multi-document summarization. In Proceedings of the Human Language Technology Conference of the NAACL. 181–184.

Digital Library

[41]

Jingdong Wang, Liyan Jia, and Xian-Sheng Hua. 2011. Interactive browsing via diversified visual summarization for image search results. Multim. Syst. 17, 5 (2011), 379–391.

Digital Library

[42]

Liwei Wang, Yin Li, Jing Huang, and Svetlana Lazebnik. 2018. Learning two-branch neural networks for image-text matching tasks. IEEE Trans. Pattern Anal. Mach. Intell. (2018). Retrieved from https://arxiv.org/abs/1704.03470.

Digital Library

[43]

William Yang Wang, Yashar Mehdad, Dragomir R. Radev, and Amanda Stent. 2016. A low-rank approximation approach to learning joint embeddings of news stories and images for timeline summarization. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT’16). 58–68.

[44]

Bo Xiong, Gunhee Kim, and Leonid Sigal. 2015. Storyline representation of egocentric videos with an applications to story-based search. In Proceedings of the IEEE International Conference on Computer Vision (CVPR’15). 4525–4533.

Digital Library

[45]

Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. 2014. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Trans. Assoc. Comput. Ling. 2 (2014), 67–78.

[46]

Junnan Zhu, Haoran Li, Tianshang Liu, Yu Zhou, Jiajun Zhang, and Chengqing Zong. 2018. MSMO: Multimodal summarization with multimodal output. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’18). 4154–4164.

[47]

Junnan Zhu, Qian Wang, Yining Wang, Yu Zhou, Jiajun Zhang, Shaonan Wang, and Chengqing Zong. 2019. NCLS: Neural cross-lingual summarization. In Proceedings of the Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP’19). 3054–3064.

[48]

Junnan Zhu, Long Zhou, Haoran Li, Jiajun Zhang, Yu Zhou, and Chengqing Zong. 2017. Augmenting neural sentence summarization through extractive summarization. In Proceedings of the 6th Conference on Natural Language Processing and Chinese Computing (NLPCC’17). 16–28.

[49]

Junnan Zhu, Yu Zhou, Jiajun Zhang, Haoran Li, Chengqing Zong, and Changliang Li. 2020. Multimodal summarization with guidance of multimodal reference. In Proceedings of the 34th AAAI Conference on Artificial Intelligence (AAAI’20).

[50]

Junnan Zhu, Yu Zhou, Jiajun Zhang, and Chengqing Zong. 2020. Attend, translate and summarize: An efficient method for neural cross-lingual summarization. In Proceedings of the 58th Meeting of the Association for Computational Linguistics (ACL’20). 1309–1321.

[51]

Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’15). 19–27.

Digital Library

[52]

Keneilwe Zuva and Tranos Zuva. 2012. Evaluation of information retrieval systems. Int. J. Comput. Sci. Inf. Technol. 4, 3 (2012), 35.

Cited By

Rong HChen ZLu ZXu FSheng V(2024)Multization: Multi-Modal Summarization Enhanced by Multi-Contextually Relevant and Irrelevant Attention AlignmentACM Transactions on Asian and Low-Resource Language Information Processing10.1145/365198323:5(1-29)Online publication date: 10-May-2024
https://dl.acm.org/doi/10.1145/3651983
Alam MHossain IPuppala STalukder S(2024)Advancements in Multimodal Social Media Post Summarization: Integrating GPT-4 for Enhanced Understanding2024 IEEE 48th Annual Computers, Software, and Applications Conference (COMPSAC)10.1109/COMPSAC61105.2024.00307(1934-1940)Online publication date: 2-Jul-2024
https://doi.org/10.1109/COMPSAC61105.2024.00307
Cui CLiang XWu SLi Z(2024)Align vision-language semantics by multi-task learning for multi-modal summarizationNeural Computing and Applications10.1007/s00521-024-09908-336:25(15653-15666)Online publication date: 17-May-2024
https://doi.org/10.1007/s00521-024-09908-3
Show More Cited By

Index Terms

Graph-based Multimodal Ranking Models for Multimodal Summarization
1. Information systems
  1. Information systems applications
    1. Multimedia information systems

Recommendations

Multimodal Video Summarization via Time-Aware Transformers
MM '21: Proceedings of the 29th ACM International Conference on Multimedia

With the growing number of videos in video sharing platforms, how to facilitate the searching and browsing of the user-generated video has attracted intense attention by multimedia community. To help people efficiently search and browse relevant videos, ...
Multimodal summarization of complex sentences
IUI '11: Proceedings of the 16th international conference on Intelligent user interfaces

In this paper, we introduce the idea of automatically illustrating complex sentences as multimodal summaries that combine pictures, structure and simplified compressed text. By including text and structure in addition to pictures, multimodal summaries ...
EPICURE - Aspect-based Multimodal Review Summarization
WebSci '18: Proceedings of the 10th ACM Conference on Web Science

Restaurant reviews are popular and a valuable source of information. Often, large number of reviews are written for restaurants which warrants the need for automated summarization systems. In this paper we present epicure, a novel text and image ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Asian and Low-Resource Language Information Processing

ACM Transactions on Asian and Low-Resource Language Information Processing Volume 20, Issue 4

July 2021

419 pages

ISSN:2375-4699

EISSN:2375-4702

DOI:10.1145/3465463

Editor:
Imed Zitouni
Google, USA

Issue’s Table of Contents

Copyright © 2021 Association for Computing Machinery.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 26 May 2021

Accepted: 01 December 2020

Revised: 01 October 2020

Received: 01 August 2019

Published in TALLIP Volume 20, Issue 4

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Refereed

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

11
Total Citations
View Citations
807
Total Downloads

Downloads (Last 12 months)350
Downloads (Last 6 weeks)30

Reflects downloads up to 11 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Rong HChen ZLu ZXu FSheng V(2024)Multization: Multi-Modal Summarization Enhanced by Multi-Contextually Relevant and Irrelevant Attention AlignmentACM Transactions on Asian and Low-Resource Language Information Processing10.1145/365198323:5(1-29)Online publication date: 10-May-2024
https://dl.acm.org/doi/10.1145/3651983
Alam MHossain IPuppala STalukder S(2024)Advancements in Multimodal Social Media Post Summarization: Integrating GPT-4 for Enhanced Understanding2024 IEEE 48th Annual Computers, Software, and Applications Conference (COMPSAC)10.1109/COMPSAC61105.2024.00307(1934-1940)Online publication date: 2-Jul-2024
https://doi.org/10.1109/COMPSAC61105.2024.00307
Cui CLiang XWu SLi Z(2024)Align vision-language semantics by multi-task learning for multi-modal summarizationNeural Computing and Applications10.1007/s00521-024-09908-336:25(15653-15666)Online publication date: 17-May-2024
https://doi.org/10.1007/s00521-024-09908-3
Bai GHe SLiu KZhao J(2023)Bidirectional Sentence Ordering with Interactive DecodingACM Transactions on Asian and Low-Resource Language Information Processing10.1145/356151022:2(1-15)Online publication date: 30-Mar-2023
https://dl.acm.org/doi/10.1145/3561510
Yuan JYun JZheng BJiao LLiu L(2023)MCRIET Computer Vision10.1049/cvi2.1217317:4(389-403)Online publication date: 7-Feb-2023
https://dl.acm.org/doi/10.1049/cvi2.12173
Wang JYang SZhao H(2023)Crisis event summary generative model based on hierarchical multimodal fusionPattern Recognition10.1016/j.patcog.2023.109890144:COnline publication date: 1-Dec-2023
https://dl.acm.org/doi/10.1016/j.patcog.2023.109890
Wang HLiu JDuan MGong PWu ZWang JHan B(2023)Cross-modal knowledge guided model for abstractive summarizationComplex & Intelligent Systems10.1007/s40747-023-01170-910:1(577-594)Online publication date: 27-Jul-2023
https://doi.org/10.1007/s40747-023-01170-9
Ramesh Kashyap AYang YKan M(2023)Scientific document processing: challenges for modern learning methodsInternational Journal on Digital Libraries10.1007/s00799-023-00352-724:4(283-309)Online publication date: 24-Mar-2023
https://dl.acm.org/doi/10.1007/s00799-023-00352-7
Zhang ZSun YSu S(2023)Multimodal Learning for Automatic Summarization: A SurveyAdvanced Data Mining and Applications10.1007/978-3-031-46664-9_25(362-376)Online publication date: 27-Aug-2023
https://dl.acm.org/doi/10.1007/978-3-031-46664-9_25
Li JZhang ZWang BZhao QZhang C(2022)Inter- and Intra-Modal Contrastive Hybrid Learning Framework for Multimodal Abstractive SummarizationEntropy10.3390/e2406076424:6(764)Online publication date: 29-May-2022
https://doi.org/10.3390/e24060764
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents