research-article

CapEnrich: Enriching Caption Semantics for Web Images via Cross-modal Pre-trained Knowledge

Authors:

Linli Yao,

Weijing Chen,

Qin JinAuthors Info & Claims

WWW '23: Proceedings of the ACM Web Conference 2023

Pages 2392 - 2401

https://doi.org/10.1145/3543507.3583232

Published: 30 April 2023 Publication History

Get Access

Abstract

Automatically generating textual descriptions for massive unlabeled images on the web can greatly benefit realistic web applications, e.g. multimodal retrieval and recommendation. However, existing models suffer from the problem of generating “over-generic” descriptions, such as their tendency to generate repetitive sentences with common concepts for different images. These generic descriptions fail to provide sufficient textual semantics for ever-changing web images. Inspired by the recent success of Vision-Language Pre-training (VLP) models that learn diverse image-text concept alignment during pretraining, we explore leveraging their cross-modal pre-trained knowledge to automatically enrich the textual semantics of image descriptions. With no need for additional human annotations, we propose a plug-and-play framework, i.e CapEnrich, to complement the generic image descriptions with more semantic details. Specifically, we first propose an automatic data-building strategy to get desired training sentences, based on which we then adopt prompting strategies, i.e. learnable and template prompts, to incentivize VLP models to generate more textual details. For learnable templates, we fix the whole VLP model and only tune the prompt vectors, which leads to two advantages: 1) the pre-training knowledge of VLP models can be reserved as much as possible to describe diverse visual concepts; 2) only lightweight trainable parameters are required, so it is friendly to low data resources. Extensive experiments show that our method significantly improves the descriptiveness and diversity of generated sentences for web images. The code is available at https://github.com/yaolinli/CapEnrich.

Supplemental Material

PDF File

Appendix

Download
1.99 MB

References

[1]

Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. 2016. Spice: Semantic propositional image caption evaluation. In European conference on computer vision. Springer, 382–398.

Abstract

Supplemental Material

References

Cited By

Index Terms

Recommendations

A support vector approach for cross-modal search of images and texts

Dialog summarization for software collaborative platform via tuning pre-trained models

Semi-supervised cross-modal hashing via modality-specific and cross-modal graph convolutional networks

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Data Availability

Funding Sources

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

HTML Format

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations