Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Diverse Image Captioning via Conditional Variational Autoencoder and Dual Contrastive Learning

Published: 18 September 2023 Publication History

Abstract

Diverse image captioning has achieved substantial progress in recent years. However, the discriminability of generative models and the limitation of cross entropy loss are generally overlooked in the traditional diverse image captioning models, which seriously hurts both the diversity and accuracy of image captioning. In this article, aiming to improve diversity and accuracy simultaneously, we propose a novel Conditional Variational Autoencoder (DCL-CVAE) framework for diverse image captioning by seamlessly integrating sequential variational autoencoder with contrastive learning. In the encoding stage, we first build conditional variational autoencoders to separately learn the sequential latent spaces for a pair of captions. Then, we introduce contrastive learning in the sequential latent spaces to enhance the discriminability of latent representations for both image-caption pairs and mismatched pairs. In the decoding stage, we leverage the captions sampled from the pre-trained Long Short-Term Memory (LSTM), LSTM decoder as the negative examples and perform contrastive learning with the greedily sampled positive examples, which can restrain the generation of common words and phrases induced by the cross entropy loss. By virtue of dual constrastive learning, DCL-CVAE is capable of encouraging the discriminability and facilitating the diversity, while promoting the accuracy of the generated captions. Extensive experiments are conducted on the challenging MSCOCO dataset, showing that our proposed methods can achieve a better balance between accuracy and diversity compared to the state-of-the-art diverse image captioning models.

References

[1]
Anirudh Goyal Alias Parth Goyal, Alessandro Sordoni, Marc-Alexandre Côté, Nan Rosemary Ke, and Yoshua Bengio. 2017. Z-forcing: Training stochastic recurrent networks. In Proceedings of the 31st Conference on Neural Information Processing Systems. 6697–6707.
[2]
Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. 2016. Spice: Semantic propositional image caption evaluation. In Proceedings of the European Conference on Computer Vision (ECCV’16). Springer, 382–398.
[3]
Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18). 6077–6086.
[4]
Jyoti Aneja, Harsh Agrawal, Dhruv Batra, and Alexander Schwing. 2019. Sequential latent spaces for modeling the intention during diverse image captioning. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV’19). 4261–4270.
[5]
Xinlei Chen and C. Lawrence Zitnick. 2015. Mind’s eye: A recurrent visual representation for image caption generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 2422–2431.
[6]
Junyoung Chung, Kyle Kastner, Laurent Dinh, Kratarth Goel, Aaron C. Courville, and Yoshua Bengio. 2015. A recurrent latent variable model for sequential data. In Proceedings of the 28th International Conference on Neural Information Processing Systems (NIPS’15). 2980–2988.
[7]
Marcella Cornia, Matteo Stefanini, Lorenzo Baraldi, and Rita Cucchiara. 2020. Meshed-memory transformer for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’20). 10578–10587.
[8]
Bo Dai and Dahua Lin. 2017. Contrastive learning for image captioning. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS’17). 898–907.
[9]
Michael Denkowski and Alon Lavie. 2014. Meteor universal: Language specific translation evaluation for any target language. In Proceedings of the 9th Workshop on Statistical Machine Translation. 376–380.
[10]
Aditya Deshpande, Jyoti Aneja, Liwei Wang, Alexander G. Schwing, and David Forsyth. 2019. Fast, diverse and accurate image captioning guided by part-of-speech. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’19). 10695–10704.
[11]
Jacob Devlin, Saurabh Gupta, Ross Girshick, Margaret Mitchell, and C. Lawrence Zitnick. 2015. Exploring nearest neighbor approaches for image captioning. CoRRabs/1505.04467, 2015.
[12]
Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. 2015. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 2625–2634.
[13]
Ali Farhadi, Mohsen Hejrati, Mohammad Amin Sadeghi, Peter Young, Cyrus Rashtchian, Julia Hockenmaier, and David Forsyth. 2010. Every picture tells a story: Generating sentences from images. In Proceedings of the 11th European Conference on Computer Vision (ECCV’10). Springer, 15–29.
[14]
Marco Fraccaro, Søren Kaae Sønderby, Ulrich Paquet, and Ole Winther. 2016. Sequential neural models with stochastic layers. In Proceedings of the 30th International Conference on Neural Information Processing Systems (NIPS’16).
[15]
Markus Freitag and Yaser Al-Onaizan. 2017. Beam search strategies for neural machine translation. In Proceedings of the First Workshop on Neural Machine Translation. 56–60.
[16]
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Computation 9, 8 (1997), 1735–1780.
[17]
Lun Huang, Wenmin Wang, Jie Chen, and Xiao-Yong Wei. 2019. Attention on attention for image captioning. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV’19). 4634–4643.
[18]
Eric Jang, Shixiang Gu, and Ben Poole. 2017. Categorical reparameterization with gumbel-softmax. In Proceedings of the International Conference on Learning Representations (ICLR’17).
[19]
Xu Jia, Efstratios Gavves, Basura Fernando, and Tinne Tuytelaars. 2015. Guiding the long-short term memory model for image caption generation. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’15). 2407–2415.
[20]
Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 3128–3137.
[21]
Lei Ke, Wenjie Pei, Ruiyu Li, Xiaoyong Shen, and Yu-Wing Tai. 2019. Reflective decoding network for image captioning. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV’19). 8887–8896.
[22]
Jonathan Krause, Justin Johnson, Ranjay Krishna, and Li Fei-Fei. 2017. A hierarchical approach for generating descriptive image paragraphs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 317–325.
[23]
Girish Kulkarni, Visruth Premraj, Vicente Ordonez, Sagnik Dhar, Siming Li, Yejin Choi, Alexander C. Berg, and Tamara L. Berg. 2013. Babytalk: Understanding and generating simple image descriptions. IEEE Transactions on Pattern Analysis and Machine Intelligence 35, 12, 2891–2903.
[24]
Dianqi Li, Qiuyuan Huang, Xiaodong He, Lei Zhang, and Ming-Ting Sun. 2018. Generating diverse and accurate visual captions by comparative adversarial learning. In Proceedings of the Conference on Neural Information Processing Systems (NeurIPS’18).
[25]
Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Proceedings of the Text Summarization Branches Out. 74–81.
[26]
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In Proceedings of the European Conference on Computer Vision (ECCV’14). Springer, 740–755.
[27]
Xihui Liu, Hongsheng Li, Jing Shao, Dapeng Chen, and Xiaogang Wang. 2018. Show, tell and discriminate: Image captioning by self-retrieval with partially labeled data. In Proceedings of the European Conference on Computer Vision (ECCV’18). Springer, 338–354.
[28]
Shweta Mahajan, Iryna Gurevych, and Stefan Roth. 2020. Latent normalizing flows for many-to-many cross-domain mappings. In Proceedings of the 8th International Conference on Learning Representations (ICLR’20).
[29]
Shweta Mahajan and Stefan Roth. 2020. Diverse image captioning with context-object split latent spaces. Advances in Neural Information Processing Systems 33 (2020), 3613–3624.
[30]
Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, Zhiheng Huang, and Alan Yuille. 2015. Deep captioning with multimodal recurrent neural networks (m-RNN). In Proceedings of the 3rd International Conference on Learning Representations (ICLR’15).
[31]
Yingwei Pan, Ting Yao, Yehao Li, and Tao Mei. 2020. X-linear attention networks for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’20). 10971–10980.
[32]
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. 311–318.
[33]
Li Ren, Guo-Jun Qi, and Kien Hua. 2019. Improving diversity of image captioning through variational autoencoders and adversarial learning. In Proceedings of the 2019 IEEE Winter Conference on Applications of Computer Vision (WACV’19). 263–272. DOI:
[34]
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 39 (2015), 1137–1149.
[35]
Zhou Ren, Xiaoyu Wang, Ning Zhang, Xutao Lv, and Li-Jia Li. 2017. Deep reinforcement learning-based image captioning with embedding reward. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). 290–298.
[36]
Rakshith Shetty, Marcus Rohrbach, Lisa Anne Hendricks, Mario Fritz, and Bernt Schiele. 2017. Speaking the same language: Matching machine to human captions by adversarial training. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’17). 4135–4144.
[37]
Naresh K. Sinha and Michael P. Griscik. 1971. A stochastic approximation method. IEEE Transactions on Systems, Man, and Cybernetics4 (1971), 338–344.
[38]
Kobus Barnard, Pinar Duygulu, David Forsyth, Nando de Freitas, David M. Blei, and Michael I. Jordan. 2003. Matching words and pictures. The Journal of Machine Learning Research 3, 38 (2003), 1107–1135.
[39]
Ashwin Vijayakumar, Michael Cogswell, Ramprasaath Selvaraju, Qing Sun, Stefan Lee, David Crandall, and Dhruv Batra. 2018. Diverse beam search for improved description of complex scenes. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI’18). 7371–7379.
[40]
Ramakrishna Vedantam, C. Lawrence Zitnick, and Devi Parikh. 2015. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 4566–4575.
[41]
Gilad Vered, Gal Oren, Yuval Atzmon, and Gal Chechik. 2020. Joint optimization for cooperative image captioning. In Proceedings of the IEEE/CVF International Conference on Computer Vision (CVPR’19). 8898–8907.
[42]
Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 3156–3164.
[43]
Jiuniu Wang, Wenjia Xu, Qingzhong Wang, and Antoni B. Chan. 2022. On distinctive image captioning via comparing and reweighting. IEEE Transactions on Pattern Analysis and Machine Intelligence (2022), 1–1.
[44]
Liwei Wang, Alexander Schwing, and Svetlana Lazebnik. 2017. Diverse and accurate image description using a variational auto-encoder with an additive gaussian encoding space. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS’17). 5758–5768.
[45]
Zhilin Yang, Ye Yuan, Yuexin Wu, William W. Cohen, and Russ R. Salakhutdinov. 2016. Review networks for caption generation. Advances in Neural Information Processing Systems 29 (2016), 2361–2369.
[46]
Ting Yao, Yingwei Pan, Yehao Li, and Tao Mei. 2018. Exploring visual relationship for image captioning. In Proceedings of the 15th European Conference on Computer Vision (ECCV’18), Vol. 11218. 711–727.
[47]
Quanzeng You, Hailin Jin, Zhaowen Wang, Chen Fang, and Jiebo Luo. 2016. Image captioning with semantic attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 4651–4659.

Cited By

View all

Index Terms

  1. Diverse Image Captioning via Conditional Variational Autoencoder and Dual Contrastive Learning

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Transactions on Multimedia Computing, Communications, and Applications
      ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 20, Issue 1
      January 2024
      639 pages
      EISSN:1551-6865
      DOI:10.1145/3613542
      • Editor:
      • Abdulmotaleb El Saddik
      Issue’s Table of Contents

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 18 September 2023
      Online AM: 11 August 2023
      Accepted: 28 July 2023
      Revised: 01 July 2023
      Received: 16 December 2022
      Published in TOMM Volume 20, Issue 1

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Diverse image captioning
      2. variational autoencoder
      3. contrastive learning
      4. sequential latent space

      Qualifiers

      • Research-article

      Funding Sources

      • National Natural Science Foundation of China

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • 0
        Total Citations
      • 424
        Total Downloads
      • Downloads (Last 12 months)340
      • Downloads (Last 6 weeks)23
      Reflects downloads up to 04 Oct 2024

      Other Metrics

      Citations

      Cited By

      View all

      View Options

      Get Access

      Login options

      Full Access

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Full Text

      View this article in Full Text.

      Full Text

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media