research-article

Understanding, Categorizing and Predicting Semantic Image-Text Relations

Authors:

Christian Otto,

Matthias Springstein,

Ralph EwerthAuthors Info & Claims

ICMR '19: Proceedings of the 2019 on International Conference on Multimedia Retrieval

Pages 168 - 176

https://doi.org/10.1145/3323873.3325049

Published: 05 June 2019 Publication History

Abstract

Two modalities are often used to convey information in a complementary and beneficial manner, e.g., in online news, videos, educational resources, or scientific publications. The automatic understanding of semantic correlations between text and associated images as well as their interplay has a great potential for enhanced multimodal web search and recommender systems. However, automatic understanding of multimodal information is still an unsolved research problem. Recent approaches such as image captioning focus on precisely describing visual content and translating it to text, but typically address neither semantic interpretations nor the specific role or purpose of an image-text constellation. In this paper, we go beyond previous work and investigate, inspired by research in visual communication, useful semantic image-text relations for multimodal information retrieval. We derive a categorization of eight semantic image-text classes (e.g., "illustration" or "anchorage") and show how they can systematically be characterized by a set of three metrics: cross-modal mutual information, semantic correlation, and the status relation of image and text. Furthermore, we present a deep learning system to predict these classes by utilizing multimodal embeddings. To obtain a sufficiently large amount of training data, we have automatically collected and augmented data from a variety of datasets and web resources, which enables future research on this topic. Experimental results on a demanding test set demonstrate the feasibility of the approach.

References

[1]

Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6077--6086.

[2]

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural Machine Translation by Jointly Learning to Align and Translate. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7--9, 2015, Conference Track Proceedings . http://arxiv.org/abs/1409.0473

[3]

Saeid Balaneshin-kordan and Alexander Kotov. 2018. Deep Neural Architecture for Multi-Modal Retrieval based on Joint Embedding Space for Text and Images. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining. ACM, 28--36.

[4]

Tadas Baltruvs aitis, Chaitanya Ahuja, and Louis-Philippe Morency. 2018. Multimodal machine learning: A survey and taxonomy. IEEE Transactions on Pattern Analysis and Machine Intelligence (2018).

[5]

Roland Barthes. 1977. Image-Music-Text, ed. and trans. S. Heath, London: Fontana, Vol. 332 (1977).

[6]

John Bateman. 2014. Text and image: A critical introduction to the visual/verbal divide .Routledge.

[7]

John Bateman, Janina Wildfeuer, and Tuomo Hiippala. 2017. Multimodality: Foundations, Research and Analysis--A Problem-Oriented Introduction .Walter de Gruyter GmbH & Co KG.

[8]

Serhat S Bucak, Rong Jin, and Anil K Jain. 2014. Multiple kernel learning for visual object recognition: A review. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 36, 7 (2014), 1354--1369.

Digital Library

[9]

Abadi et al. 2015a. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. https://www.tensorflow.org/

[10]

Szegedy et al. 2015b. Going deeper with convolutions. IEEE Conference on Computer Vision and Pattern Recognition .

[11]

Mengdi Fan, Wenmin Wang, Peilei Dong, Liang Han, Ronggang Wang, and Ge Li. 2017. Cross-media Retrieval by Learning Rich Semantic Embeddings of Multimedia. ACM Multimedia Conference (2017).

Digital Library

[12]

Mehmet Gönen and Ethem Alpaydin. 2011. Multiple kernel learning algorithms. Journal of machine learning research, Vol. 12, Jul (2011), 2211--2268.

Digital Library

[13]

Edouard Grave, Tomas Mikolov, Armand Joulin, and Piotr Bojanowski. 2017. Bag of Tricks for Efficient Text Classification. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2017, Valencia, Spain, April 3--7, 2017, Volume 2: Short Papers . 427--431. https://aclanthology.info/papers/E17--2068/e17--2068

[14]

Michael Alexander Kirkwood Halliday and Christian MIM Matthiessen. 2013. Halliday's introduction to functional grammar .Routledge.

[15]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. IEEE Conference on Computer Vision and Pattern Recognition .

[16]

Christian Andreas Henning and Ralph Ewerth. 2017. Estimating the Information Gap between Textual and Visual Representations. ACM International Conference on Multimedia Retrieval (2017).

Digital Library

[17]

Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. 2017. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4700--4708.

[18]

Ting-Hao K. Huang, Francis Ferraro, Nasrin Mostafazadeh, Ishan Misra, Jacob Devlin, Aishwarya Agrawal, Ross Girshick, Xiaodong He, Pushmeet Kohli, Dhruv Batra, et almbox. 2016. Visual Storytelling. Conference of the North American Chapter of the Association for Computational Linguistics .

[19]

Zaeem Hussain, Mingda Zhang, Xiaozhong Zhang, Keren Ye, Christopher Thomas, Zuha Agha, Nathan Ong, and Adriana Kovashka. 2017. Automatic Understanding of Image and Video Advertisements. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21--26, 2017. 1100--1110.

[20]

Natasha Jaques, Sara Taylor, Akane Sano, and Rosalind Picard. 2015. Multi-task, multi-kernel learning for estimating individual wellbeing. In Proc. NIPS Workshop on Multimodal Machine Learning, Montreal, Quebec, Vol. 898.

[21]

Justin Johnson, Andrej Karpathy, and Li Fei-Fei. 2016. Densecap: Fully convolutional localization networks for dense captioning. IEEE Conference on Computer Vision and Pattern Recognition .

[22]

Andrej Karpathy, Armand Joulin, and Fei Fei F Li. 2014. Deep fragment embeddings for bidirectional image sentence mapping. Advances in Neural Information Processing Systems .

Digital Library

[23]

Klaus Krippendorff. 1970. Estimating the reliability, systematic error and random error of interval data. Educational and Psychological Measurement, Vol. 30, 1 (1970).

[24]

Weiyu Lan, Xirong Li, and Jianfeng Dong. 2017. Fluency-guided cross-lingual image captioning. In Proceedings of the 2017 ACM on Multimedia Conference. ACM, 1549--1557.

Digital Library

[25]

Jian Liang, Zhihang Li, Dong Cao, Ran He, and Jingdong Wang. 2016. Self-Paced Cross-Modal Subspace Matching. ACM SIGIR Conference on Research and Development in Information Retrieval .

Digital Library

[26]

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common Objects in Context. In European Conference on Computer Vision, David Fleet, Tomas Pajdla, Bernt Schiele, and Tinne Tuytelaars (Eds.). Springer International Publishing, Cham, 740--755.

[27]

Fayao Liu, Luping Zhou, Chunhua Shen, and Jianping Yin. 2014. Multiple kernel learning in the primal for multimodal Alzheimer's disease classification. IEEE J. Biomedical and Health Informatics, Vol. 18, 3 (2014), 984--990.

[28]

Radan Martinec and Andrew Salway. 2005. A system for image-text relations in new (and old) media . Visual Communication, Vol. 4 (2005).

[29]

Masoud Mazloom, Robert Rietveld, Stevan Rudinac, Marcel Worring, and Willemijn van Dolen. 2016. Multimodal Popularity Prediction of Brand-related Social Media Posts . ACM Multimedia Conference .

Digital Library

[30]

Scott McCloud. 1993. Understanding comics: The invisible art. Northampton, Mass (1993).

[31]

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems .

Digital Library

[32]

Niluthpol Chowdhury Mithun, Juncheng Li, Florian Metze, and Amit K Roy-Chowdhury. 2018a. Learning Joint Embedding with Multimodal Cues for Cross-Modal Video-Text Retrieval. In Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval. ACM, 19--27.

Digital Library

[33]

Niluthpol Chowdhury Mithun, Rameswar Panda, Evangelos E. Papalexakis, and Amit K. Roy-Chowdhury. 2018b. Webly Supervised Joint Embedding for Cross-Modal Image-Text Retrieval. In Proceedings of the 26th ACM International Conference on Multimedia (MM '18). ACM, New York, NY, USA, 1856--1864.

Digital Library

[34]

Winfried Nöth. 1995. Handbook of semiotics .Indiana University Press.

[35]

My English Pages. 2017--11--23. List of antonyms and opposites. http://www.myenglishpages.com/site_php_files/vocabulary-lesson-opposites.php

[36]

Soujanya Poria, Erik Cambria, and Alexander Gelbukh. 2015. Deep convolutional neural network textual features and multiple kernel learning for utterance-level multimodal sentiment analysis. In Proceedings of the 2015 conference on empirical methods in natural language processing . 2539--2544.

[37]

Jinwei Qi, Yuxin Peng, and Yunkan Zhuo. 2018. Life-long Cross-media Correlation Learning. In 2018 ACM Multimedia Conference on Multimedia Conference. ACM.

[38]

Vasili Ramanishka, Abir Das, Dong Huk Park, Subhashini Venugopalan, Lisa Anne Hendricks, Marcus Rohrbach, and Kate Saenko. 2016. Multimodal Video Description . ACM Multimedia Conference .

Digital Library

[39]

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. 2015. ImageNet Large Scale Visual Recognition Challenge . International Journal of Computer Vision (2015).

Digital Library

[40]

Rossano Schifanella, Paloma de Juan, Joel Tetreault, and Liangliang Cao. 2016. Detecting Sarcasm in Multimodal Social Platforms . ACM Multimedia Conference .

Digital Library

[41]

Ekaterina Shutova, Douwe Kelia, and Jean Maillard. 2016. Black Holes and White Rabbits : Metaphor Identification with Visual Features . Naacl (2016).

[42]

Arnold WM Smeulders, Marcel Worring, Simone Santini, Amarnath Gupta, and Ramesh Jain. 2000. Content-based image retrieval at the end of the early years. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 22, 12 (2000).

Digital Library

[43]

Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A Alemi. 2017. Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning. AAAI .

[44]

Len Unsworth. 2007. Image/text relations and intersemiosis: Towards multimodal text description for multiliteracies education. In Proceedings of the 33rd International Systemic Functional Congress. 1165--1205.

[45]

Theo Van Leeuwen. 2005. Introducing Social Semiotics .Psychology Press.

[46]

Nan Xu and Wenji Mao. 2017. MultiSentiNet: A Deep Semantic Network for Multimodal Sentiment Analysis. In ACM on Conference on Information and Knowledge Management. ACM.

Digital Library

[47]

Xing Xu, Jingkuan Song, Huimin Lu, Yang Yang, Fumin Shen, and Zi Huang. 2018. Modal-adversarial Semantic Learning Network for Extendable Cross-modal Retrieval. In Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval. ACM, 46--54.

Digital Library

[48]

Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alexander J Smola, and Eduard H Hovy. 2016. Hierarchical Attention Networks for Document Classification. North American Chapter of the Association for Computational Linguistics: Human Language Technologies .

[49]

Yi-Ren Yeh, Ting-Chu Lin, Yung-Yu Chung, and Yu-Chiang Frank Wang. 2012. A novel multiple kernel learning framework for heterogeneous feature fusion and variable selection. IEEE Transactions on multimedia, Vol. 14, 3 (2012), 563--574.

Digital Library

[50]

Mingda Zhang, Rebecca Hwa, and Adriana Kovashka. 2018. Equal But Not The Same: Understanding the Implicit Relationship Between Persuasive Images and Text. In British Machine Vision Conference 2018, BMVC 2018, Northumbria University, Newcastle, UK, September 3--6, 2018 . 8. http://bmvc2018.org/contents/papers/0228.pdf

Cited By

Kastner MCheema GHakimov SGarcia NGurrin CKongkachandra RSchoeffmann KDang-Nguyen DRossetto LSatoh SZhou L(2024)MUWS 2024: The 3rd International Workshop on Multimodal Human Understanding for the Web and Social MediaProceedings of the 2024 International Conference on Multimedia Retrieval10.1145/3652583.3658893(1342-1344)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3652583.3658893
Ouyang PChen JMa QWang ZBai C(2024)Distinguishing Visually Similar Images: Triplet Contrastive Learning Framework for Image-text Retrieval2024 IEEE International Conference on Multimedia and Expo (ICME)10.1109/ICME57554.2024.10687694(1-6)Online publication date: 15-Jul-2024
https://doi.org/10.1109/ICME57554.2024.10687694
Alikhani MKhalid BStone M(2023)Image–text coherence and its implications for multimodal AIFrontiers in Artificial Intelligence10.3389/frai.2023.10488746Online publication date: 15-May-2023
https://doi.org/10.3389/frai.2023.1048874
Show More Cited By

Index Terms

Understanding, Categorizing and Predicting Semantic Image-Text Relations
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
        Visual content-based indexing and retrieval
    2. Natural language processing
  2. Machine learning
    1. Machine learning approaches
      1. Neural networks
2. Information systems
  1. Information retrieval
    1. Specialized information retrieval
      1. Multimedia and multimodal retrieval

Recommendations

MUWS'2023: The 2nd International Workshop on Multimodal Human Understanding for the Web and Social Media
CIKM '23: Proceedings of the 32nd ACM International Conference on Information and Knowledge Management

Multimodal human understanding and analysis is an emerging research area that cuts through several disciplines like Computer Vision, Natural Language Processing (NLP), Speech Processing, Human-Computer Interaction, and Multimedia. Several multimodal ...
“Is This an Example Image?” – Predicting the Relative Abstractness Level of Image and Text
Advances in Information Retrieval
Abstract
Successful multimodal search and retrieval requires the automatic understanding of semantic cross-modal relations, which, however, is still an open research problem. Previous work has suggested the metrics cross-modal mutual information and ...
Multimodal spatial reference in mediated environments: users' preferences and the pragmatics of pointing and talking
CHI EA '06: CHI '06 Extended Abstracts on Human Factors in Computing Systems

This paper describes the current results and future developments of a project on multimodal spatial reference in mediated environments. The database consists of video-recorded sessions, with 120 participants in three experimental designs, contrasting ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ICMR '19: Proceedings of the 2019 on International Conference on Multimedia Retrieval

June 2019

427 pages

ISBN:9781450367653

DOI:10.1145/3323873

General Chairs:
Abdulmotaleb El Saddik
University of Ottawa, Canada
,
Alberto Del Bimbo
University of Florence, Italy
,
Zhongfei Zhang
Binghamton University, State University of New York, USA
,
Program Chairs:
Alexander Hauptmann
Carnegie Mellon University, USA
,
K. Selcuk Candan
Arizona State University, USA
,
Marco Bertini
University of Florence, Italy
,
Lexing Xie
Australia National University, Australia
,
Xiao-Yong Wei
Sichuan University, China

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 05 June 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Badges

Best Paper

Author Tags

Qualifiers

Research-article

Funding Sources

Leibniz-Gemeinschaft

Conference

ICMR '19

Sponsor:

SIGMM

ICMR '19: International Conference on Multimedia Retrieval

June 10 - 13, 2019

Ottawa ON, Canada

Acceptance Rates

Overall Acceptance Rate 254 of 830 submissions, 31%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

14
Total Citations
View Citations
754
Total Downloads

Downloads (Last 12 months)75
Downloads (Last 6 weeks)6

Reflects downloads up to 03 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Kastner MCheema GHakimov SGarcia NGurrin CKongkachandra RSchoeffmann KDang-Nguyen DRossetto LSatoh SZhou L(2024)MUWS 2024: The 3rd International Workshop on Multimodal Human Understanding for the Web and Social MediaProceedings of the 2024 International Conference on Multimedia Retrieval10.1145/3652583.3658893(1342-1344)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3652583.3658893
Ouyang PChen JMa QWang ZBai C(2024)Distinguishing Visually Similar Images: Triplet Contrastive Learning Framework for Image-text Retrieval2024 IEEE International Conference on Multimedia and Expo (ICME)10.1109/ICME57554.2024.10687694(1-6)Online publication date: 15-Jul-2024
https://doi.org/10.1109/ICME57554.2024.10687694
Alikhani MKhalid BStone M(2023)Image–text coherence and its implications for multimodal AIFrontiers in Artificial Intelligence10.3389/frai.2023.10488746Online publication date: 15-May-2023
https://doi.org/10.3389/frai.2023.1048874
Cheema GHakimov SKastner MGarcia NFrommholz IHopfgartner FLee MOakes MLalmas MZhang MSantos R(2023)MUWS'2023: The 2nd International Workshop on Multimodal Human Understanding for the Web and Social MediaProceedings of the 32nd ACM International Conference on Information and Knowledge Management10.1145/3583780.3615310(5263-5266)Online publication date: 21-Oct-2023
https://dl.acm.org/doi/10.1145/3583780.3615310
Mu MDas Bhattacharjee SYuan J(2023)Self-Supervised Distilled Learning for Multi-modal Misinformation Identification2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)10.1109/WACV56688.2023.00284(2818-2827)Online publication date: Jan-2023
https://doi.org/10.1109/WACV56688.2023.00284
Salin EAyache SFavre B(2023)Towards an Exhaustive Evaluation of Vision-Language Foundation Models2023 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW)10.1109/ICCVW60793.2023.00041(339-352)Online publication date: 2-Oct-2023
https://doi.org/10.1109/ICCVW60793.2023.00041
Anundskås LAfridi HTarekegn AYamin MUllah MYamin SCheikh F(2023)Glove-Ing Attention: A Multi-Modal Neural Learning Approach to Image Captioning2023 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops (ICASSPW)10.1109/ICASSPW59220.2023.10193011(1-5)Online publication date: 4-Jun-2023
https://doi.org/10.1109/ICASSPW59220.2023.10193011
Pardi GHienert DKammerer Y(2022)Examining the use of text and video resources during web-search based learning—a new methodological approachNew Review of Hypermedia and Multimedia10.1080/13614568.2022.209958328:1-2(39-67)Online publication date: 14-Jul-2022
https://doi.org/10.1080/13614568.2022.2099583
Spranger MXi JJaeckel LFelser JLabudde D(2022)MoNA: A Forensic Analysis Platform for Mobile CommunicationKI - Künstliche Intelligenz10.1007/s13218-022-00762-w36:2(163-169)Online publication date: 24-May-2022
https://doi.org/10.1007/s13218-022-00762-w
Ivaschenko AKrivosheev AStolbova AGolovnin O(2021)Hybridization of Intelligent Solutions Architecture for Text Understanding and Text GenerationApplied Sciences10.3390/app1111517911:11(5179)Online publication date: 2-Jun-2021
https://doi.org/10.3390/app11115179
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents