Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Improving Reference-Based Distinctive Image Captioning with Contrastive Rewards

Published: 25 November 2024 Publication History

Abstract

Distinctive Image Captioning (DIC)—generating distinctive captions that describe the unique details of a target image—has received considerable attention over the last few years. A recent DIC method proposes to generate distinctive captions by comparing the target image with a set of semantic-similar reference images, i.e., reference-Based DIC (Ref-DIC). It aims to force the generated captions to distinguish between the target image and the reference image. Unfortunately, reference images used by existing Ref-DIC works are easy to distinguish: these reference images only resemble the target image at scene-level and have few common objects, such that a Ref-DIC model can trivially generate distinctive captions even without considering the reference images. For example, if the target image contains objects “towel” and “toilet” while all reference images are without them, then a simple caption “A bathroom with a towel and a toilet” is distinctive enough to tell apart target and reference images. To ensure Ref-DIC models really perceive the unique objects (or attributes) in target images, we first propose two new Ref-DIC benchmarks. Specifically, we design a two-stage matching mechanism, which strictly controls the similarity between the target and reference images at the object-/attribute-level (vs. scene-level). Second, to generate distinctive captions, we develop a Transformer-based Ref-DIC baseline TransDIC. It not only extracts visual features from the target image but also encodes the differences between objects in the target and reference images. Taking one step further, we propose a stronger TransDIC\({++}\), which consists of an extra contrastive learning module to make full use of the reference images. This new module is model-agnostic, which can be easily incorporated into various Ref-DIC architectures. Finally, for more trustworthy benchmarking, we propose a new evaluation metric named DisCIDEr for Ref-DIC, which evaluates both the accuracy and distinctiveness of the generated captions. Experimental results demonstrate that our TransDIC\({++}\) can generate distinctive captions. Besides, it outperforms several state-of-the-art models on the two new benchmarks over different metrics.

References

[1]
Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. 2016. Spice: Semantic propositional image caption evaluation. In European Conference on Computer Vision, 382–398.
[2]
Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 6077–6086.
[3]
J. Lei Ba, J. R. Kiros, G. E. Hinton. 2016. Layer Normalization. arXiv: 1607.06450. Retrieved from https://arxiv.org/abs/1607.06450
[4]
Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Association for Computational Linguistics Workshop, 65–72.
[5]
Manuele Barraco, Marcella Cornia, Silvia Cascianelli, Lorenzo Baraldi, and Rita Cucchiara. 2022. The unreasonable effectiveness of CLIP features for image captioning: An experimental analysis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4662–4670.
[6]
Fuhai Chen, Rongrong Ji, Xiaoshuai Sun, Yongjian Wu, and Jinsong Su. 2018. Groupcap: Group-based image captioning with structured relevance and diversity constraints. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1345–1353.
[7]
Long Chen, Zhihong Jiang, Jun Xiao, and Wei Liu. 2021. Human-like controllable image captioning with verb-specific semantic roles. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 16846–16856.
[8]
Lizhi Chen, You Yang, Juntao Hu, Longyue Pan, and Hao Zhai. 2023. Relational-convergent transformer for image captioning. Displays 77 (2023), 102377.
[9]
Long Chen, Hanwang Zhang, Jun Xiao, Xiangnan He, Shiliang Pu, and Shih-Fu Chang. 2019. Counterfactual critic multi-agent training for scene graph generation. In ICCV, 4613–4623.
[10]
Long Chen, Hanwang Zhang, Jun Xiao, Liqiang Nie, Jian Shao, Wei Liu, and Tat-Seng Chua. 2017. Sca-cnn: Spatial and channel-wise attention in convolutional networks for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 5659–5667.
[11]
Long Chen, Yuhang Zheng, Yulei Niu, Hanwang Zhang, and Jun Xiao. 2023. Counterfactual samples synthesizing and training for robust visual question answering. IEEE Transactions on Pattern Analysis and Machine Intelligence. 45, 11 (2023), 13218–13234.
[12]
Shizhe Chen, Qin Jin, Peng Wang, and Qi Wu. 2020. Say as you wish: Fine-grained control of image caption generation with abstract scene graphs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 9962–9971
[13]
Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. 2020. A simple framework for contrastive learning of visual representations. In ICML. PMLR, 1597–1607.
[14]
Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollár, and C. Lawrence Zitnick. 2015. Microsoft coco captions: Data collection and evaluation server. arXiv:1504.00325. Retrieved from https://arxiv.org/abs/1504.00325
[15]
Xinlei Chen and Kaiming He. 2021. Exploring simple siamese representation learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 15750–15758.
[16]
Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. 2020. Uniter: Universal image-text representation learning. In European Conference on Computer Vision. Springer, 104–120.
[17]
M. Cornia, M. Stefanini, L. Baraldi, and R. Cucchiara. 2020. Meshed-memory transformer for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 10578–10587.
[18]
Bo Dai, Sanja Fidler, Raquel Urtasun, and Dahua Lin. 2017. Towards diverse and natural image descriptions via a conditional gan. In International Conference on Computer Vision, 2970–2979.
[19]
Bo Dai and Dahua Lin. 2017. Contrastive learning for image captioning. In Advances in Neural Information Processing Systems, 898–907.
[20]
Fartash Faghri, David J. Fleet, Jamie Ryan Kiros, and Sanja Fidler. 2017. Vse++: Improving visual-semantic embeddings with hard negatives. arXiv:1707.05612. Retrieved from https://arxiv.org/abs/1707.05612
[21]
Zhengcong Fei. 2022. Attention-aligned transformer for image captioning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36, 607–615.
[22]
Zhengcong Fei, Mingyuan Fan, Li Zhu, Junshi Huang, Xiaoming Wei, and Xiaolin Wei. 2023. Uncertainty-aware image captioning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 37, 614–622.
[23]
Mingfei Gao, Chen Xing, Juan Carlos Niebles, Junnan Li, Ran Xu, Wenhao Liu, and Caiming Xiong. 2022. Open vocabulary object detection with pseudo bounding-box labels. In European Conference on Computer Vision, 266–282.
[24]
Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Guo, Mohammad Gheshlaghi Azar, Bilal Piot, Koray Kavukcuoglu, Rémi Munos, and Michal Valko. 2020. Bootstrap your own latent-a new approach to self-supervised learning. In NeurIPS, Vol. 33, 21271–21284.
[25]
Chunrui Han, Shiguang Shan, Meina Kan, Shuzhe Wu, and Xilin Chen. 2018. Face recognition with contrastive convolution. In ECCV, 118–134.
[26]
Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. 2020. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 9729–9738.
[27]
Ukyo Honda, Taro Watanabe, and Yuji Matsumoto. 2023. Switching to discriminative image captioning by relieving a bottleneck of reinforcement learning. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 1124–1134.
[28]
Lun Huang, Wenmin Wang, Jie Chen, and Xiao-Yong Wei. 2019. Attention on attention for image captioning. In International Conference on Computer Vision, 4634–4643.
[29]
Shuaiqi Jing, Haonan Zhang, Pengpeng Zeng, Lianli Gao, Jingkuan Song, and Heng Tao Shen. 2023. Memory-based augmentation network for video captioning. IEEE Transactions on Multimedia 26 (2023), 2367–2379.
[30]
Justin Johnson, Andrej Karpathy, and Li Fei-Fei. 2016. Densecap: Fully convolutional localization networks for dense captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 4565–4574.
[31]
Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 3128–3137.
[32]
Guang Li, Linchao Zhu, Ping Liu, and Yi Yang. 2019. Entangled transformer for image captioning. In International Conference on Computer Vision, 8928–8937.
[33]
Lin Li, Guikun Chen, Jun Xiao, Yi Yang, Chunping Wang, and Long Chen. 2023. Compositional feature augmentation for unbiased scene graph generation. In International Conference on Computer Vision, 21685–21695.
[34]
Lin Li, Long Chen, Yifeng Huang, Zhimeng Zhang, Songyang Zhang, and Jun Xiao. 2022. The devil is in the labels: Noisy label correction for robust scene graph generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 18869–18878.
[35]
Lin Li, Jun Xiao, Guikun Chen, Jian Shao, Yueting Zhuang, and Long Chen. 2023. Zero-shot visual relation detection via composite visual cues from large language models. In Advances in Neural Information Processing Systems, Vol. 36, 50105–50116.
[36]
Zhuowan Li, Quan Tran, Long Mai, Zhe Lin, and Alan L. Yuille. 2020. Context-aware group captioning via self-attention and contrastive features. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 3440–3450.
[37]
Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Association for Computational Linguistics Workshop, 74–81.
[38]
An-An Liu, Hongshuo Tian, Ning Xu, Weizhi Nie, Yongdong Zhang, and Mohan Kankanhalli. 2021. Toward region-aware attention learning for scene graph generation. IEEE Transactions on Neural Networks and Learning Systems 33, 12 (2021), 7655–7666.
[39]
An-An Liu, Yingchen Zhai, Ning Xu, Weizhi Nie, Wenhui Li, and Yongdong Zhang. 2022. Region-aware image captioning via interaction learning. IEEE Transactions on Circuits and Systems for Video Technology 32, 6 (2012), 3685–3696.
[40]
Lixin Liu, Jiajun Tang, Xiaojun Wan, and Zongming Guo. 2019. Generating diverse and descriptive image captions using visual paraphrases. In International Conference on Computer Vision, 4240–4249.
[41]
Siqi Liu, Zhenhai Zhu, Ning Ye, Sergio Guadarrama, and Kevin Murphy. 2017. Improved image captioning via policy gradient optimization of spider. In International Conference on Computer Vision, 873–881.
[42]
Xihui Liu, Hongsheng Li, Jing Shao, Dapeng Chen, and Xiaogang Wang. 2018. Show, tell and discriminate: Image captioning by self-retrieval with partially labeled data. In European Conference on Computer Vision, 338–354.
[43]
Ruotian Luo, Brian Price, Scott Cohen, and Gregory Shakhnarovich. 2018. Discriminability objective for training descriptive captions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 6964–6974.
[44]
Yangjun Mao, Long Chen, Zhihong Jiang, Dong Zhang, Zhimeng Zhang, Jian Shao, and Jun Xiao. 2022. Rethinking the reference-based distinctive image captioning. In ACM International Conference on Multimedia, 4374–4384.
[45]
Yingwei Pan, Ting Yao, Yehao Li, and Tao Mei. 2020. X-linear attention networks for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10971–10980.
[46]
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: A method for automatic evaluation of machine translation. In Meeting of Association for Computational Linguistics, 311–318.
[47]
Dong Huk Park, Trevor Darrell, and Anna Rohrbach. 2019. Robust change captioning. In International Conference on Computer Vision, 4624–4633.
[48]
Bryan A. Plummer, Liwei Wang, Chris M. Cervantes, Juan C. Caicedo, Julia Hockenmaier, and Svetlana Lazebnik. 2015. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In Proceedings of the IEEE International Conference on Computer Vision, 2641–2649.
[49]
Yue Qiu, Shintaro Yamamoto, Kodai Nakashima, Ryota Suzuki, Kenji Iwata, Hirokatsu Kataoka, and Yutaka Satoh. 2021. Describing and localizing multiple changes with transformers. In International Conference on Computer Vision, 1971–1980.
[50]
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning transferable visual models from natural language supervision. In International Conference on Computer Vision, 8748–8763.
[51]
Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba. 2015. Sequence level training with recurrent neural networks. arXiv:1511.06732. Retrieved from https://arxiv.org/abs/1511.06732
[52]
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2016. Faster R-CNN: Towards real-time object detection with region proposal networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1137–1149.
[53]
Steven J. Rennie, Etienne Marcheret, Youssef Mroueh, Jerret Ross, and Vaibhava Goel. 2017. Self-critical sequence training for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 7008–7024.
[54]
Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. 2017. Grad-cam: Visual explanations from deep networks via gradient-based localization. In International Conference on Computer Vision, 618–626.
[55]
Rakshith Shetty, Marcus Rohrbach, Lisa Anne Hendricks, Mario Fritz, and Bernt Schiele. 2017. Speaking the same language: Matching machine to human captions by adversarial training. In International Conference on Computer Vision, 4135–4144.
[56]
Alane Suhr, Stephanie Zhou, Ally Zhang, Iris Zhang, Huajun Bai, and Yoav Artzi. 2018. A corpus for reasoning about natural language grounded in photographs. arXiv:1811.00491. Retrieved from https://arxiv.org/abs/1811.00491
[57]
Hao Tan, Franck Dernoncourt, Zhe Lin, Trung Bui, and Mohit Bansal. 2019. Expressing visual relationships via language. arXiv:1906.07689. Retrieved from https://arxiv.org/abs/1906.07689
[58]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, 6000–6010.
[59]
Ramakrishna Vedantam, Samy Bengio, Kevin Murphy, Devi Parikh, and Gal Chechik. 2017. Context-aware captions from context-agnostic supervision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 251–260.
[60]
Ramakrishna Vedantam, C. Lawrence Zitnick, and Devi Parikh. 2015. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 4566–4575.
[61]
Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 3156–3164.
[62]
Depeng Wang, Zhenzhen Hu, Yuanen Zhou, Richang Hong, and Meng Wang. 2022. A text-guided generation and refinement model for image captioning. IEEE Transactions on Multimedia 25 (2022), 2966–2977.
[63]
Jiuniu Wang, Wenjia Xu, Qingzhong Wang, and Antoni B. Chan. 2020. Compare and reweight: Distinctive image captioning using similar images sets. In European Conference on Computer Vision, 370–386.
[64]
Jiuniu Wang, Wenjia Xu, Qingzhong Wang, and Antoni B. Chan. 2021. Group-based distinctive image captioning with memory attention. In ACM International Conference on Multimedia, 5020–5028.
[65]
Yanhui Wang, Ning Xu, An-An Liu, Wenhui Li, and Yongdong Zhang. 2021. High-order interaction learning for image captioning. IEEE Transactions on Circuits and Systems for Video Technology 32, 7 (2021), 4417–4430.
[66]
Zhen Wang, Long Chen, Wenbo Ma, Guangxing Han, Yulei Niu, Jian Shao, and Jun Xiao. 2022. Explicit image caption editing. In European Conference on Computer Vision. Springer, 113–129.
[67]
Zeyu Wang, Berthy Feng, Karthik Narasimhan, and Olga Russakovsky. 2020. Towards unique and informative captioning of images. In European Conference on Computer Vision, 629–644.
[68]
Zhen Wang, Jun Xiao, Tao Chen, and Long Chen. 2024. DECap: Towards generalized explicit caption editing via diffusion mechanism. In European Conference on Computer Vision, 365–381.
[69]
Zhen Wang, Jun Xiao, Yueting Zhuang, Fei Gao, Jian Shao, and Long Chen. 2024. Learning combinatorial prompts for universal controllable image captioning. International Journal of Computer Vision (2024), 1–22.
[70]
Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In International Conference on Machine Learning, 2048–2057.
[71]
Ning Xu, An-An Liu, Yongkang Wong, Weizhi Nie, Yuting Su, and Mohan Kankanhalli. 2020. Scene graph inference via multi-scale context modeling. IEEE Transactions on Circuits and Systems for Video Technology (2020), 1031–1041.
[72]
Ning Xu, Hanwang Zhang, An-An Liu, Weizhi Nie, Yuting Su, Jie Nie, and Yongdong Zhang. 2019. Multi-level policy and reward-based deep reinforcement learning framework for image captioning. IEEE Transactions on Multimedia (2019), 1372–1383.
[73]
Lixia Xue, Awen Zhang, Ronggui Wang, and Juan Yang. 2023. PSNet: Position-shift alignment network for image caption. International Journal of Multimedia Information Retrieval 12, 2 (2023), 42.
[74]
An Yan, Xin Eric Wang, Tsu-Jui Fu, and William Yang Wang. 2021. L2C: Describing visual differences needs semantic understanding of individuals. arXiv:2102.01860. Retrieved from https://arxiv.org/abs/12102.01860
[75]
Peixin Yan, Zuoyong Li, Rong Hu, and Xinrong Cao. 2024. BENet: Bi-directional enhanced network for image captioning. Multimedia Systems 30, 1 (2024), 48.
[76]
Xu Yang, Kaihua Tang, Hanwang Zhang, and Jianfei Cai. 2019. Auto-encoding scene graphs for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 10685–10694.
[77]
Ting Yao, Yingwei Pan, Yehao Li, and Tao Mei. 2018. Exploring visual relationship for image captioning. In European Conference on Computer Vision, 684–699.
[78]
Sergey Zagoruyko and Nikos Komodakis. 2015. Learning to compare image patches via convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 4353–4361.
[79]
Pengpeng Zeng, Haonan Zhang, Jingkuan Song, and Lianli Gao. 2022. S2 transformer for image captioning. In IJCAI, 1608–1614.
[80]
Haonan Zhang, Pengpeng Zeng, Lianli Gao, Xinyu Lyu, Jingkuan Song, and Heng Tao Shen. 2024. SPT: Spatial pyramid transformer for image captioning. IEEE Transactions on Circuits and Systems for Video Technology 34, 6 (2024), 4829–4842.
[81]
Xuying Zhang, Xiaoshuai Sun, Yunpeng Luo, Jiayi Ji, Yiyi Zhou, Yongjian Wu, Feiyue Huang, and Rongrong Ji. 2021. Rstnet: Captioning with adaptive attention on visual and non-visual words. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 15465–15474.

Index Terms

  1. Improving Reference-Based Distinctive Image Captioning with Contrastive Rewards

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Transactions on Multimedia Computing, Communications, and Applications
      ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 20, Issue 12
      December 2024
      721 pages
      EISSN:1551-6865
      DOI:10.1145/3618076
      Issue’s Table of Contents

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 25 November 2024
      Online AM: 24 September 2024
      Accepted: 13 August 2024
      Revised: 03 May 2024
      Received: 18 June 2023
      Published in TOMM Volume 20, Issue 12

      Check for updates

      Author Tags

      1. Image Captioning
      2. Distinctiveness
      3. Benchmark
      4. Transformer
      5. Contrastive Learning

      Qualifiers

      • Research-article

      Funding Sources

      • National Natural Science Foundation of China
      • Fundamental Research Funds for the Central Universities
      • HKUST Special
      • HKUST Sports Science and Technology Research

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • 0
        Total Citations
      • 117
        Total Downloads
      • Downloads (Last 12 months)117
      • Downloads (Last 6 weeks)14
      Reflects downloads up to 15 Feb 2025

      Other Metrics

      Citations

      View Options

      Login options

      Full Access

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Full Text

      View this article in Full Text.

      Full Text

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media