Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3323873.3325049acmconferencesArticle/Chapter ViewAbstractPublication PagesicmrConference Proceedingsconference-collections
research-article

Understanding, Categorizing and Predicting Semantic Image-Text Relations

Published: 05 June 2019 Publication History

Abstract

Two modalities are often used to convey information in a complementary and beneficial manner, e.g., in online news, videos, educational resources, or scientific publications. The automatic understanding of semantic correlations between text and associated images as well as their interplay has a great potential for enhanced multimodal web search and recommender systems. However, automatic understanding of multimodal information is still an unsolved research problem. Recent approaches such as image captioning focus on precisely describing visual content and translating it to text, but typically address neither semantic interpretations nor the specific role or purpose of an image-text constellation. In this paper, we go beyond previous work and investigate, inspired by research in visual communication, useful semantic image-text relations for multimodal information retrieval. We derive a categorization of eight semantic image-text classes (e.g., "illustration" or "anchorage") and show how they can systematically be characterized by a set of three metrics: cross-modal mutual information, semantic correlation, and the status relation of image and text. Furthermore, we present a deep learning system to predict these classes by utilizing multimodal embeddings. To obtain a sufficiently large amount of training data, we have automatically collected and augmented data from a variety of datasets and web resources, which enables future research on this topic. Experimental results on a demanding test set demonstrate the feasibility of the approach.

References

[1]
Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6077--6086.
[2]
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural Machine Translation by Jointly Learning to Align and Translate. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7--9, 2015, Conference Track Proceedings . http://arxiv.org/abs/1409.0473
[3]
Saeid Balaneshin-kordan and Alexander Kotov. 2018. Deep Neural Architecture for Multi-Modal Retrieval based on Joint Embedding Space for Text and Images. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining. ACM, 28--36.
[4]
Tadas Baltruvs aitis, Chaitanya Ahuja, and Louis-Philippe Morency. 2018. Multimodal machine learning: A survey and taxonomy. IEEE Transactions on Pattern Analysis and Machine Intelligence (2018).
[5]
Roland Barthes. 1977. Image-Music-Text, ed. and trans. S. Heath, London: Fontana, Vol. 332 (1977).
[6]
John Bateman. 2014. Text and image: A critical introduction to the visual/verbal divide .Routledge.
[7]
John Bateman, Janina Wildfeuer, and Tuomo Hiippala. 2017. Multimodality: Foundations, Research and Analysis--A Problem-Oriented Introduction .Walter de Gruyter GmbH & Co KG.
[8]
Serhat S Bucak, Rong Jin, and Anil K Jain. 2014. Multiple kernel learning for visual object recognition: A review. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 36, 7 (2014), 1354--1369.
[9]
Abadi et al. 2015a. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. https://www.tensorflow.org/
[10]
Szegedy et al. 2015b. Going deeper with convolutions. IEEE Conference on Computer Vision and Pattern Recognition .
[11]
Mengdi Fan, Wenmin Wang, Peilei Dong, Liang Han, Ronggang Wang, and Ge Li. 2017. Cross-media Retrieval by Learning Rich Semantic Embeddings of Multimedia. ACM Multimedia Conference (2017).
[12]
Mehmet Gönen and Ethem Alpaydin. 2011. Multiple kernel learning algorithms. Journal of machine learning research, Vol. 12, Jul (2011), 2211--2268.
[13]
Edouard Grave, Tomas Mikolov, Armand Joulin, and Piotr Bojanowski. 2017. Bag of Tricks for Efficient Text Classification. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2017, Valencia, Spain, April 3--7, 2017, Volume 2: Short Papers . 427--431. https://aclanthology.info/papers/E17--2068/e17--2068
[14]
Michael Alexander Kirkwood Halliday and Christian MIM Matthiessen. 2013. Halliday's introduction to functional grammar .Routledge.
[15]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. IEEE Conference on Computer Vision and Pattern Recognition .
[16]
Christian Andreas Henning and Ralph Ewerth. 2017. Estimating the Information Gap between Textual and Visual Representations. ACM International Conference on Multimedia Retrieval (2017).
[17]
Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. 2017. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4700--4708.
[18]
Ting-Hao K. Huang, Francis Ferraro, Nasrin Mostafazadeh, Ishan Misra, Jacob Devlin, Aishwarya Agrawal, Ross Girshick, Xiaodong He, Pushmeet Kohli, Dhruv Batra, et almbox. 2016. Visual Storytelling. Conference of the North American Chapter of the Association for Computational Linguistics .
[19]
Zaeem Hussain, Mingda Zhang, Xiaozhong Zhang, Keren Ye, Christopher Thomas, Zuha Agha, Nathan Ong, and Adriana Kovashka. 2017. Automatic Understanding of Image and Video Advertisements. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21--26, 2017. 1100--1110.
[20]
Natasha Jaques, Sara Taylor, Akane Sano, and Rosalind Picard. 2015. Multi-task, multi-kernel learning for estimating individual wellbeing. In Proc. NIPS Workshop on Multimodal Machine Learning, Montreal, Quebec, Vol. 898.
[21]
Justin Johnson, Andrej Karpathy, and Li Fei-Fei. 2016. Densecap: Fully convolutional localization networks for dense captioning. IEEE Conference on Computer Vision and Pattern Recognition .
[22]
Andrej Karpathy, Armand Joulin, and Fei Fei F Li. 2014. Deep fragment embeddings for bidirectional image sentence mapping. Advances in Neural Information Processing Systems .
[23]
Klaus Krippendorff. 1970. Estimating the reliability, systematic error and random error of interval data. Educational and Psychological Measurement, Vol. 30, 1 (1970).
[24]
Weiyu Lan, Xirong Li, and Jianfeng Dong. 2017. Fluency-guided cross-lingual image captioning. In Proceedings of the 2017 ACM on Multimedia Conference. ACM, 1549--1557.
[25]
Jian Liang, Zhihang Li, Dong Cao, Ran He, and Jingdong Wang. 2016. Self-Paced Cross-Modal Subspace Matching. ACM SIGIR Conference on Research and Development in Information Retrieval .
[26]
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common Objects in Context. In European Conference on Computer Vision, David Fleet, Tomas Pajdla, Bernt Schiele, and Tinne Tuytelaars (Eds.). Springer International Publishing, Cham, 740--755.
[27]
Fayao Liu, Luping Zhou, Chunhua Shen, and Jianping Yin. 2014. Multiple kernel learning in the primal for multimodal Alzheimer's disease classification. IEEE J. Biomedical and Health Informatics, Vol. 18, 3 (2014), 984--990.
[28]
Radan Martinec and Andrew Salway. 2005. A system for image-text relations in new (and old) media . Visual Communication, Vol. 4 (2005).
[29]
Masoud Mazloom, Robert Rietveld, Stevan Rudinac, Marcel Worring, and Willemijn van Dolen. 2016. Multimodal Popularity Prediction of Brand-related Social Media Posts . ACM Multimedia Conference .
[30]
Scott McCloud. 1993. Understanding comics: The invisible art. Northampton, Mass (1993).
[31]
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems .
[32]
Niluthpol Chowdhury Mithun, Juncheng Li, Florian Metze, and Amit K Roy-Chowdhury. 2018a. Learning Joint Embedding with Multimodal Cues for Cross-Modal Video-Text Retrieval. In Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval. ACM, 19--27.
[33]
Niluthpol Chowdhury Mithun, Rameswar Panda, Evangelos E. Papalexakis, and Amit K. Roy-Chowdhury. 2018b. Webly Supervised Joint Embedding for Cross-Modal Image-Text Retrieval. In Proceedings of the 26th ACM International Conference on Multimedia (MM '18). ACM, New York, NY, USA, 1856--1864.
[34]
Winfried Nöth. 1995. Handbook of semiotics .Indiana University Press.
[35]
My English Pages. 2017--11--23. List of antonyms and opposites. http://www.myenglishpages.com/site_php_files/vocabulary-lesson-opposites.php
[36]
Soujanya Poria, Erik Cambria, and Alexander Gelbukh. 2015. Deep convolutional neural network textual features and multiple kernel learning for utterance-level multimodal sentiment analysis. In Proceedings of the 2015 conference on empirical methods in natural language processing . 2539--2544.
[37]
Jinwei Qi, Yuxin Peng, and Yunkan Zhuo. 2018. Life-long Cross-media Correlation Learning. In 2018 ACM Multimedia Conference on Multimedia Conference. ACM.
[38]
Vasili Ramanishka, Abir Das, Dong Huk Park, Subhashini Venugopalan, Lisa Anne Hendricks, Marcus Rohrbach, and Kate Saenko. 2016. Multimodal Video Description . ACM Multimedia Conference .
[39]
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. 2015. ImageNet Large Scale Visual Recognition Challenge . International Journal of Computer Vision (2015).
[40]
Rossano Schifanella, Paloma de Juan, Joel Tetreault, and Liangliang Cao. 2016. Detecting Sarcasm in Multimodal Social Platforms . ACM Multimedia Conference .
[41]
Ekaterina Shutova, Douwe Kelia, and Jean Maillard. 2016. Black Holes and White Rabbits : Metaphor Identification with Visual Features . Naacl (2016).
[42]
Arnold WM Smeulders, Marcel Worring, Simone Santini, Amarnath Gupta, and Ramesh Jain. 2000. Content-based image retrieval at the end of the early years. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 22, 12 (2000).
[43]
Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A Alemi. 2017. Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning. AAAI .
[44]
Len Unsworth. 2007. Image/text relations and intersemiosis: Towards multimodal text description for multiliteracies education. In Proceedings of the 33rd International Systemic Functional Congress. 1165--1205.
[45]
Theo Van Leeuwen. 2005. Introducing Social Semiotics .Psychology Press.
[46]
Nan Xu and Wenji Mao. 2017. MultiSentiNet: A Deep Semantic Network for Multimodal Sentiment Analysis. In ACM on Conference on Information and Knowledge Management. ACM.
[47]
Xing Xu, Jingkuan Song, Huimin Lu, Yang Yang, Fumin Shen, and Zi Huang. 2018. Modal-adversarial Semantic Learning Network for Extendable Cross-modal Retrieval. In Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval. ACM, 46--54.
[48]
Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alexander J Smola, and Eduard H Hovy. 2016. Hierarchical Attention Networks for Document Classification. North American Chapter of the Association for Computational Linguistics: Human Language Technologies .
[49]
Yi-Ren Yeh, Ting-Chu Lin, Yung-Yu Chung, and Yu-Chiang Frank Wang. 2012. A novel multiple kernel learning framework for heterogeneous feature fusion and variable selection. IEEE Transactions on multimedia, Vol. 14, 3 (2012), 563--574.
[50]
Mingda Zhang, Rebecca Hwa, and Adriana Kovashka. 2018. Equal But Not The Same: Understanding the Implicit Relationship Between Persuasive Images and Text. In British Machine Vision Conference 2018, BMVC 2018, Northumbria University, Newcastle, UK, September 3--6, 2018 . 8. http://bmvc2018.org/contents/papers/0228.pdf

Cited By

View all
  • (2024)MUWS 2024: The 3rd International Workshop on Multimodal Human Understanding for the Web and Social MediaProceedings of the 2024 International Conference on Multimedia Retrieval10.1145/3652583.3658893(1342-1344)Online publication date: 30-May-2024
  • (2024)Distinguishing Visually Similar Images: Triplet Contrastive Learning Framework for Image-text Retrieval2024 IEEE International Conference on Multimedia and Expo (ICME)10.1109/ICME57554.2024.10687694(1-6)Online publication date: 15-Jul-2024
  • (2023)Image–text coherence and its implications for multimodal AIFrontiers in Artificial Intelligence10.3389/frai.2023.10488746Online publication date: 15-May-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ICMR '19: Proceedings of the 2019 on International Conference on Multimedia Retrieval
June 2019
427 pages
ISBN:9781450367653
DOI:10.1145/3323873
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 05 June 2019

Permissions

Request permissions for this article.

Check for updates

Badges

  • Best Paper

Author Tags

  1. data augmentation
  2. image-text relations
  3. multimodality
  4. semantic gap

Qualifiers

  • Research-article

Funding Sources

  • Leibniz-Gemeinschaft

Conference

ICMR '19
Sponsor:

Acceptance Rates

Overall Acceptance Rate 254 of 830 submissions, 31%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)75
  • Downloads (Last 6 weeks)6
Reflects downloads up to 03 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2024)MUWS 2024: The 3rd International Workshop on Multimodal Human Understanding for the Web and Social MediaProceedings of the 2024 International Conference on Multimedia Retrieval10.1145/3652583.3658893(1342-1344)Online publication date: 30-May-2024
  • (2024)Distinguishing Visually Similar Images: Triplet Contrastive Learning Framework for Image-text Retrieval2024 IEEE International Conference on Multimedia and Expo (ICME)10.1109/ICME57554.2024.10687694(1-6)Online publication date: 15-Jul-2024
  • (2023)Image–text coherence and its implications for multimodal AIFrontiers in Artificial Intelligence10.3389/frai.2023.10488746Online publication date: 15-May-2023
  • (2023)MUWS'2023: The 2nd International Workshop on Multimodal Human Understanding for the Web and Social MediaProceedings of the 32nd ACM International Conference on Information and Knowledge Management10.1145/3583780.3615310(5263-5266)Online publication date: 21-Oct-2023
  • (2023)Self-Supervised Distilled Learning for Multi-modal Misinformation Identification2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)10.1109/WACV56688.2023.00284(2818-2827)Online publication date: Jan-2023
  • (2023)Towards an Exhaustive Evaluation of Vision-Language Foundation Models2023 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW)10.1109/ICCVW60793.2023.00041(339-352)Online publication date: 2-Oct-2023
  • (2023)Glove-Ing Attention: A Multi-Modal Neural Learning Approach to Image Captioning2023 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops (ICASSPW)10.1109/ICASSPW59220.2023.10193011(1-5)Online publication date: 4-Jun-2023
  • (2022)Examining the use of text and video resources during web-search based learning—a new methodological approachNew Review of Hypermedia and Multimedia10.1080/13614568.2022.209958328:1-2(39-67)Online publication date: 14-Jul-2022
  • (2022)MoNA: A Forensic Analysis Platform for Mobile CommunicationKI - Künstliche Intelligenz10.1007/s13218-022-00762-w36:2(163-169)Online publication date: 24-May-2022
  • (2021)Hybridization of Intelligent Solutions Architecture for Text Understanding and Text GenerationApplied Sciences10.3390/app1111517911:11(5179)Online publication date: 2-Jun-2021
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media