Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3404555.3404610acmotherconferencesArticle/Chapter ViewAbstractPublication PagesiccaiConference Proceedingsconference-collections
research-article

Generalization or Instantiation?: Estimating the Relative Abstractness between Images and Text

Published: 20 August 2020 Publication History

Abstract

Learning from multi-modal data is very often in current data mining and knowledge management applications. However, the information imbalance between modalities brings challenges for many multi-modal learning tasks, such as cross-modal retrieval, image captioning, and image synthesis. Understanding the cross-modal information gap is an important foundation for designing models and choosing the evaluating criteria of those applications. Especially for text and image data, existing researches have proposed the abstractness to measure the information imbalance. They evaluate the abstractness disparity by training a classifier using the manually annotated multi-modal sample pairs. However, these methods ignore the impact of the intra-modal relationship on the inter-modal abstractness; besides, the annotating process is very labor-intensive, and the quality cannot be guaranteed. In order to evaluate the text-image relationship more comprehensively and reduce the cost of evaluating, we propose the relative abstractness index (RAI) to measure the abstractness between multi-modal items, which measures the abstractness of a sample according to its certainty of differentiating the items of another modality. Besides, we proposed a cycled generating model to compute the RAI values between images and text. In contrast to existing works, the proposed index can better describe the image-text information disparity, and its computing process needs no annotated training samples.

References

[1]
Baltrušaitis, T., Ahuja, C., and Morency, L., 2019. Multimodal Machine Learning: A Survey and Taxonomy. IEEE Transactions on Pattern Analysis and Machine Intelligence 41, 2, 423--443.
[2]
Barthes, R., 1978. The Rhetoric of the Image. In Image-Music-Text Hill and Wang, London, 32--51.
[3]
Bateman, J., 2014. Text and Image: A Critical Introduction to the Visual/Verbal Divide. Routledge, New York.
[4]
Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., and Bengio, Y., 2014. Learning Phrase Representations Using RNN Encoder-Decoder for Statistical Machine Translation. In Empirical Methods on Natural Language Processing(EMNLP) ACL, 1724--1734.
[5]
Doan, H. and Nguyen, V., 2019. Improving Dynamic Hand Gesture Recognition on Multi-views with Multi-modalities. International Journal of Machine Learning and Computing 9, 6, 795--800.
[6]
Gu, J., Cai, J., Wang, G., and Chen, T., 2018. Stack-Captioning: Coarse-to-Fine Learning for Image Captioning. In AAAI Conference on Artifcial Intelligence AAAI, 6837--6844.
[7]
Henning, C. and Ewerth, R., 2018. Estimating the Information Gap between Textual and Visual Representations. International Journal of Multimedia Information Retrieval 7, 1 (March 01), 43--56.
[8]
Karpathy, A. and Fei-Fei, L., 2017. Deep Visual-Semantic Alignments for Generating Image Descriptions. IEEE Transactions on Pattern Analysis and Machine Intelligence 39, 4, 664--676.
[9]
Kingma, D.P. and Ba, J., 2015. Adam: A method for stochastic optimization. In International Conference on Learning Representations.
[10]
Marsh, E.E. and Domas White, M., 2003. A Taxonomy of Relationships between Images and Text. Journal of Documentation 59, 6, 647--672.
[11]
Martinec, R. and Salway, A., 2005. A System for Image-text Relations in New (and old) Media. Visual communication 4, 3, 337--371.
[12]
Otto, C., Holzki, S., and Ewerth, R., 2019. "Is this an example image?" -- Predicting the Relative Abstractness Level of Image and Text. In European Conference on Information Retrieval Springer, 711--725.
[13]
Qiao, T., Zhang, J., Xu, D., and Tao, D., 2019. MirrorGAN: Learning Text-to-image Generation by Redescription. In IEEE Conference on Computer Vision and Pattern Recognition IEEE, 1505--1514.
[14]
Radford, A., Metz, L., and Chintala, S., 2016. Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks. In Proceedings of the International Conference on Learning Representations (San Juan, Puerto Rico, May 2-4 2016).
[15]
Rashtchian, C., Young, P., Hodosh, M., and Hockenmaier, J., 2010. Collecting Image Annotations Using Amazon's Mechanical Turk. In NAACL Hlt 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk NAACL, 139--147.
[16]
Rennie, S.J., Marcheret, E., Mroueh, Y., Ross, J., and Goel, V., 2017. Self-Critical Sequence Training for Image Captioning. In IEEE Conference on Computer Vision and Pattern Recognition IEEE, 7008--7024.
[17]
Unsworth, L., 2007. Image/text Relations and Intersemiosis: Towards Multimodal Text Description for Multiliteracies Education. In International Systemic Functional Congress Springer, 1165--1205.
[18]
Xu, T., Zhang, P., Huang, Q., Zhang, H., Gan, Z., Huang, X., and He, X., 2018. Attngan: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks. In IEEE Conference on Computer Vision and Pattern Recognition IEEE, 1316--1324.
[19]
Zhang, H., Xu, T., Li, H., Zhang, S., Wang, X., Huang, X., and Metaxas, D.N., 2017. Stackgan: Text to Photo-Realistic Image Synthesis with Stacked Generative Adversarial Networks. In IEEE International Conference on Computer Vision IEEE, 5907--5915.
[20]
Zhu, J.-Y., Park, T., Isola, P., and Efros, A.A., 2017. Unpaired Image-to-Image Translation Using Cycle-Consistent Adversarial Networks. In IEEE International Conference on Computer Vision IEEE, 2223--2232.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
ICCAI '20: Proceedings of the 2020 6th International Conference on Computing and Artificial Intelligence
April 2020
563 pages
ISBN:9781450377089
DOI:10.1145/3404555
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

In-Cooperation

  • University of Tsukuba: University of Tsukuba

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 August 2020

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Multi-modality
  2. generative adversarial networks
  3. image-text relationship
  4. relative abstractness

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

ICCAI '20

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 37
    Total Downloads
  • Downloads (Last 12 months)3
  • Downloads (Last 6 weeks)0
Reflects downloads up to 02 Sep 2024

Other Metrics

Citations

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media