research-article

Visual Captioning at Will: Describing Images and Videos Guided by a Few Stylized Sentences

Authors:

Qin JinAuthors Info & Claims

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

Pages 5705 - 5715

https://doi.org/10.1145/3581783.3612263

Published: 27 October 2023 Publication History

Abstract

Stylized visual captioning aims to generate image or video descriptions with specific styles, making them more attractive and emotionally appropriate. One major challenge with this task is the lack of paired stylized captions for visual content, so most existing works focus on unsupervised methods that do not rely on parallel datasets. However, these approaches still require training with sufficient examples that have style labels, and the generated captions are limited to predefined styles. To address these limitations, we explore the problem of Few-Shot Stylized Visual Captioning, which aims to generate captions in any desired style, using only a few examples as guidance during inference, without requiring further training. We propose a framework called FS-StyleCap for this task, which utilizes a conditional encoder-decoder language model and a visual projection module. Our two-step training scheme proceeds as follows: first, we train a style extractor to generate style representations on an unlabeled text-only corpus. Then, we freeze the extractor and enable our decoder to generate stylized descriptions based on the extracted style vector and projected visual content vectors. During inference, our model can generate desired stylized captions by deriving the style representation from user-supplied examples. Our automatic evaluation results for few-shot sentimental visual captioning outperform state-of-the-art approaches and are comparable to models that are fully trained on labeled style corpora. Human evaluations further confirm our model's ability to handle multiple styles.

References

[1]

Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization. 65--72.

[2]

Yi Bin, Xindi Shang, Bo Peng, Yujuan Ding, and Tat-Seng Chua. 2021. Multi-Perspective Video Captioning. In Proceedings of the 29th ACM International Conference on Multimedia. 5110--5118.

Digital Library

[3]

Cheng-Kuan Chen, Zhufeng Pan, Ming-Yu Liu, and Min Sun. 2019. Unsupervised stylish image description generation via domain layer norm. In Proceedings of the AAAI Conference on Artificial Intelligence. 8151--8158.

Digital Library

[4]

Tianlang Chen, Zhongping Zhang, Quanzeng You, Chen Fang, Zhaowen Wang, Hailin Jin, and Jiebo Luo. 2018. "Factual”" or "Emotional": Stylized Image Captioning with Adaptive Learning and Attention. In Proceedings of the European Conference on Computer Vision (ECCV). 519--535.

Digital Library

[5]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).

[6]

Chuang Gan, Zhe Gan, Xiaodong He, Jianfeng Gao, and Li Deng. 2017. Stylenet: Generating attractive visual captions with styles. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3137--3146.

[7]

Sophia Gu, Christopher Clark, and Aniruddha Kembhavi. 2022. I Can't Believe There's No Images! Learning Visual Tasks Using only Language Data. arXiv preprint arXiv:2211.09778 (2022).

[8]

Longteng Guo, Jing Liu, Peng Yao, Jiangwei Li, and Hanqing Lu. 2019. Mscap: Multi-style image captioning with unpaired stylized text. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4204--4213.

[9]

Raia Hadsell, Sumit Chopra, and Yann LeCun. 2006. Dimensionality reduction by learning an invariant mapping. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06), Vol. 2. IEEE, 1735--1742.

Digital Library

[10]

Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. 2021. Clipscore: A reference-free evaluation metric for image captioning. arXiv preprint arXiv:2104.08718 (2021).

[11]

Xiaowei Hu, Zhe Gan, Jianfeng Wang, Zhengyuan Yang, Zicheng Liu, Yumao Lu, and Lijuan Wang. 2022a. Scaling up vision-language pre-training for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 17980--17989.

[12]

Zhiqiang Hu, Roy Ka-Wei Lee, Charu C Aggarwal, and Aston Zhang. 2022b. Text style transfer: A review and experimental evaluation. ACM SIGKDD Explorations Newsletter, Vol. 24, 1 (2022), 14--45.

Digital Library

[13]

Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3128--3137.

[14]

Guodun Li, Yuchen Zhai, Zehao Lin, and Yin Zhang. 2021. Similar scenes arouse similar emotions: Parallel data augmentation for stylized image captioning. In Proceedings of the 29th ACM International Conference on Multimedia. 5363--5372.

Digital Library

[15]

Juncen Li, Robin Jia, He He, and Percy Liang. 2018. Delete, retrieve, generate: a simple approach to sentiment and style transfer. arXiv preprint arXiv:1804.06437 (2018).

[16]

Qinyu Li, Tengpeng Li, Hanli Wang, and Chang Wen Chen. 2022. Taking an Emotional Look at Video Paragraph Captioning. arXiv preprint arXiv:2203.06356 (2022).

[17]

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In Proceedings of the European Conference on Computer Vision (ECCV). 740--755.

[18]

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).

[19]

Guoqing Luo, Yu Tong Han, Lili Mou, and Mauajama Firdaus. 2023. Prompt-Based Editing for Text Style Transfer. arXiv preprint arXiv:2301.11997 (2023).

[20]

Huaishao Luo, Lei Ji, Ming Zhong, Yang Chen, Wen Lei, Nan Duan, and Tianrui Li. 2022. CLIP4Clip: An empirical study of CLIP for end to end video clip retrieval and captioning. Neurocomputing, Vol. 508 (2022), 293--304.

Digital Library

[21]

Alexander Mathews, Lexing Xie, and Xuming He. 2016. Senticap: Generating image descriptions with sentiments. In Proceedings of the AAAI conference on artificial intelligence.

[22]

Alexander Mathews, Lexing Xie, and Xuming He. 2018. Semstyle: Learning to generate stylised image captions using unaligned text. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 8591--8600.

[23]

Remi Mir, Bjarke Felbo, Nick Obradovich, and Iyad Rahwan. 2019. Evaluating style transfer for text. arXiv preprint arXiv:1904.02295 (2019).

[24]

Ron Mokady, Amir Hertz, and Amit H Bermano. 2021. Clipcap: Clip prefix for image captioning. arXiv preprint arXiv:2111.09734 (2021).

[25]

David Nukrai, Ron Mokady, and Amir Globerson. 2022. Text-Only Training for Image Captioning using Noise-Injected CLIP. arXiv preprint arXiv:2211.00575 (2022).

[26]

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics. 311--318.

Digital Library

[27]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning. PMLR, 8748--8763.

[28]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, Vol. 21, 1 (2020), 5485--5551.

Digital Library

[29]

Emily Reif, Daphne Ippolito, Ann Yuan, Andy Coenen, Chris Callison-Burch, and Jason Wei. 2021. A recipe for arbitrary text style transfer with large language models. arXiv preprint arXiv:2109.03910 (2021).

[30]

Parker Riley, Noah Constant, Mandy Guo, Girish Kumar, David Uthus, and Zarana Parekh. 2020. TextSETTR: Few-shot text style extraction and tunable targeted restyling. arXiv preprint arXiv:2010.03802 (2020).

[31]

Kurt Shuster, Samuel Humeau, Hexiang Hu, Antoine Bordes, and Jason Weston. 2019. Engaging image captioning via personality. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12516--12526.

[32]

Irene Solaiman, Miles Brundage, Jack Clark, Amanda Askell, Ariel Herbert-Voss, Jeff Wu, Alec Radford, Gretchen Krueger, Jong Wook Kim, Sarah Kreps, et al. 2019. Release strategies and the social impacts of language models. arXiv preprint arXiv:1908.09203 (2019).

[33]

Andreas Stolcke. 2002. SRILM-an extensible language modeling toolkit. In Seventh international conference on spoken language processing.

[34]

Mirac Suzgun, Luke Melas-Kyriazi, and Dan Jurafsky. 2022. Prompt-and-rerank: A method for zero-shot and few-shot arbitrary textual style transfer with small language models. arXiv preprint arXiv:2205.11503 (2022).

[35]

Yutong Tan, Zheng Lin, Peng Fu, Mingyu Zheng, Lanrui Wang, Yanan Cao, and Weipinng Wang. 2022. Detach and attach: Stylized image captioning without paired stylized dataset. In Proceedings of the 30th ACM International Conference on Multimedia. 4733--4741.

Digital Library

[36]

Mingkang Tang, Zhanyu Wang, Zhenhua Liu, Fengyun Rao, Dian Li, and Xiu Li. 2021. Clip4caption: Clip for video caption. In Proceedings of the 29th ACM International Conference on Multimedia. 4858--4862.

Digital Library

[37]

Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of machine learning research, Vol. 9, 11 (2008).

[38]

Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. 2015. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4566--4575.

[39]

Jianfeng Wang, Zhengyuan Yang, Xiaowei Hu, Linjie Li, Kevin Lin, Zhe Gan, Zicheng Liu, Ce Liu, and Lijuan Wang. 2022. Git: A generative image-to-text transformer for vision and language. arXiv preprint arXiv:2205.14100 (2022).

[40]

Xinxiao Wu and Tong Li. 2023. Sentimental Visual Captioning using Multimodal Transformer. International Journal of Computer Vision (2023), 1--18.

Digital Library

[41]

Jun Xu, Tao Mei, Ting Yao, and Yong Rui. 2016. Msr-vtt: A large video description dataset for bridging video and language. In Proceedings of the IEEE conference on computer vision and pattern recognition. 5288--5296.

[42]

Peng Xu, Jackie Chi Kit Cheung, and Yanshuai Cao. 2020. On variational learning of controllable representations for text without supervision. In International Conference on Machine Learning. PMLR, 10534--10543.

[43]

Quanzeng You, Hailin Jin, and Jiebo Luo. 2018. Image captioning at will: A versatile scheme for effectively injecting sentiments into image descriptions. arXiv preprint arXiv:1801.10121 (2018).

[44]

Zequn Zeng, Hao Zhang, Zhengjue Wang, Ruiying Lu, Dongsheng Wang, and Bo Chen. 2023. ConZIC: Controllable Zero-shot Image Captioning by Sampling-Based Polishing. arXiv preprint arXiv:2303.02437 (2023).

[45]

Wentian Zhao, Xinxiao Wu, and Xiaoxun Zhang. 2020. MemCap: Memorizing style knowledge for image captioning. In Proceedings of the AAAI Conference on Artificial Intelligence. 12984--12992.

Cited By

Drishya PManmadhan S(2025)Interpreting Personality Traits in Social Media Images Through Visual Question AnsweringAdvances in Data and Information Sciences10.1007/978-981-97-7360-2_7(65-75)Online publication date: 3-Jan-2025
https://doi.org/10.1007/978-981-97-7360-2_7
Yao LZhang YWang ZHou XGe TJiang YSun XJin QCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Edit As You Wish: Video Caption Editing with Multi-grained User ControlProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680724(1924-1933)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3680724
Yang YZhao XZhao KJin ZTao WLiu LLi SSerra ESpezzano F(2024)Multi-Stage Refined Visual Captioning for Baidu Ad Creatives GenerationProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3679969(4198-4202)Online publication date: 21-Oct-2024
https://dl.acm.org/doi/10.1145/3627673.3679969
Show More Cited By

Index Terms

Visual Captioning at Will: Describing Images and Videos Guided by a Few Stylized Sentences
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
    2. Natural language processing
      1. Natural language generation

Recommendations

Detach and Attach: Stylized Image Captioning without Paired Stylized Dataset
MM '22: Proceedings of the 30th ACM International Conference on Multimedia

Stylized Image Captioning aims to generate captions with accurate image content and stylized elements simultaneously. However, large-scaled image and stylized caption pairs cost lots of resources and are usually unavailable. Therefore, it's a challenge ...
Plugging Stylized Controls in Open-Stylized Image Captioning
Pattern Recognition and Computer Vision
Abstract
Image captioning is a classical multi-modal task for vision-language understanding. In recent years, researchers have begun to focus on generating captions with personalized styles, but the range of available styles is often fixed. The existing ...
Unsupervised style-guided cross-domain adaptation for few-shot stylized face translation
Abstract
When a domain-to-domain translation-based generative model is trained on a target domain with limited instances and no paired samples, two key issues will emerge: the introduction of noticeable artifacts and the severe overfitting. Existing ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

October 2023

9913 pages

ISBN:9798400701085

DOI:10.1145/3581783

General Chairs:
Abdulmotaleb El Saddik
University of Ottawa, Canada & MBZUAI, UAE
,
Tao Mei
HiDream.ai, China
,
Rita Cucchiara
University of Modena and Reggio Emilia, Italy
,
Program Chairs:
Marco Bertini
University of Florence, Italy
,
Diana Patricia Tobon Vallejo
Unversidad de Medellin, Colombia
,
Pradeep K. Atrey
University at Albany, State University of New York, USA
,
M. Shamim Hossain
M. Shamim Hossain (King Saud University, KSA

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 October 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

MM '23

Sponsor:

SIGMM

MM '23: The 31st ACM International Conference on Multimedia

October 29 - November 3, 2023

Ottawa ON, Canada

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

5
Total Citations
View Citations
146
Total Downloads

Downloads (Last 12 months)89
Downloads (Last 6 weeks)5

Reflects downloads up to 22 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Drishya PManmadhan S(2025)Interpreting Personality Traits in Social Media Images Through Visual Question AnsweringAdvances in Data and Information Sciences10.1007/978-981-97-7360-2_7(65-75)Online publication date: 3-Jan-2025
https://doi.org/10.1007/978-981-97-7360-2_7
Yao LZhang YWang ZHou XGe TJiang YSun XJin QCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Edit As You Wish: Video Caption Editing with Multi-grained User ControlProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680724(1924-1933)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3680724
Yang YZhao XZhao KJin ZTao WLiu LLi SSerra ESpezzano F(2024)Multi-Stage Refined Visual Captioning for Baidu Ad Creatives GenerationProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3679969(4198-4202)Online publication date: 21-Oct-2024
https://dl.acm.org/doi/10.1145/3627673.3679969
Sharma DDhiman CKumar D(2024)Control With Style: Style Embedding-Based Variational Autoencoder for Controlled Stylized Caption Generation FrameworkIEEE Transactions on Cognitive and Developmental Systems10.1109/TCDS.2024.340557316:6(2032-2042)Online publication date: Dec-2024
https://doi.org/10.1109/TCDS.2024.3405573
Nitta TFukuzawa TTamaki T(2024)Fine-Grained Length Controllable Video Captioning With Ordinal EmbeddingsIEEE Access10.1109/ACCESS.2024.350675112(189667-189688)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3506751

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents