Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Visual-linguistic-stylistic Triple Reward for Cross-lingual Image Captioning

Published: 11 January 2024 Publication History
  • Get Citation Alerts
  • Abstract

    Generating image captions in different languages is worth exploring and essential for non-native speakers. Nevertheless, collecting paired annotation for every language is time-consuming and impractical, particularly for minor languages. To this end, the cross-lingual image captioning task is proposed, which leverages existing image-source caption annotation data and wild unrelated target corpus to generate satisfactory caption in the target language. Current methods perform a two-step translation process of image-to-pivot (source) and pivot-to-target. The distinct two-step process comes with certain caption issues, such as the weak semantic alignment between the image and the generated caption and the generated caption’s non-target language style. To address these issues, we propose an end-to-end reinforce learning framework with Visual-linguistic-stylistic Triple Reward named TriR. In TriR, we jointly consider the visual, linguistic, and stylistic alignments to generate factual, fluent, and natural caption in the target language. To be specific, the image-source caption annotation provides factual semantic guidance, whereas the unrelated target corpus guides the language style of generated caption. To achieve this, we construct a visual reward module to measure the cross-modal semantic embedding of image and target caption, a linguistic reward module to measure the cross-linguistic embedding of source and target captions, and a stylistic reward module to imitate the presentation style of target corpus. The TriR can be implemented with either classical CNN-LSTM or prevalent Transformer architecture. Extensive experiments are conducted with four cross-lingual settings, i.e., Chinese-to-English, English-to-Chinese, English-to-German, and English-to-French. Experimental results demonstrate the remarkable superiority of our method, and sufficient ablation experiments validate the beneficial impact of every reward.

    References

    [1]
    Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6077–6086.
    [2]
    Huixia Ben, Yingwei Pan, Yehao Li, Ting Yao, Richang Hong, Meng Wang, and Tao Mei. 2022. Unpaired image captioning with semantic-constrained self-learning. IEEE Trans. Multim. 24 (2022), 904–916. DOI:
    [3]
    Iacer Calixto, Qun Liu, and Nick Campbell. 2017. Doubly-attentive decoder for multi-modal neural machine translation. arXiv preprint arXiv:1702.01287 (2017).
    [4]
    Iacer Calixto, Qun Liu, and Nick Campbell. 2017. Incorporating global visual features into attention-based neural machine translation. arXiv preprint arXiv:1701.06521 (2017).
    [5]
    Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. 2020. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision (ECCV’20), Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm (Eds.). Springer International Publishing, Cham, 213–229.
    [6]
    Shizhe Chen, Qin Jin, Peng Wang, and Qi Wu. 2020. Say as you wish: Fine-grained control of image caption generation with abstract scene graphs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’20).
    [7]
    Yun Chen, Yang Liu, and Victor Li. 2018. Zero-resource neural machine translation with multi-agent communication game. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32.
    [8]
    Michael Denkowski and Alon Lavie. 2014. Meteor universal: Language specific translation evaluation for any target language. In Proceedings of the 9th Workshop on Statistical Machine Translation. 376–380.
    [9]
    Jianfeng Dong, Xirong Li, Chaoxi Xu, Xun Yang, Gang Yang, Xun Wang, and Meng Wang. 2021. Dual encoding for video retrieval by text. IEEE Trans. Pattern Anal. Mach. Intell. 44, 8 (2021), 4065–4080.
    [10]
    Desmond Elliott, Stella Frank, Khalil Sima’an, and Lucia Specia. 2016. Multi30K: Multilingual English-German Image Descriptions. arxiv:1605.00459
    [11]
    Desmond Elliott and Akos Kádár. 2017. Imagination improves multimodal translation. arXiv preprint arXiv:1705.04350 (2017).
    [12]
    Yang Feng, Lin Ma, Wei Liu, and Jiebo Luo. 2019. Unsupervised image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’19).
    [13]
    Jiahui Gao, Yi Zhou, Philip L. H. Yu, Shafiq Joty, and Jiuxiang Gu. 2022. UNISON: Unpaired cross-lingual image captioning. Proc. AAAI Conf. Artif. Intell.e 36, 10 (June 2022), 10654–10662. DOI:
    [14]
    Jiuxiang Gu, Shafiq Joty, Jianfei Cai, and Gang Wang. 2018. Unpaired image captioning by language pivoting. In Proceedings of the European Conference on Computer Vision (ECCV’18).
    [15]
    Jiuxiang Gu, Shafiq Joty, Jianfei Cai, Handong Zhao, Xu Yang, and Gang Wang. 2019. Unpaired image captioning via scene graph alignments. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV’19).
    [16]
    Dan Guo, Kun Li, Zheng-Jun Zha, and Meng Wang. 2019. DADNet: Dilated-attention-deformable ConvNet for crowd counting. In Proceedings of the 27th ACM International Conference on Multimedia. 1823–1832.
    [17]
    Dan Guo, Yang Wang, Peipei Song, and Meng Wang. 2020. Recurrent relational memory network for unsupervised image captioning. In Proceedings of the International Joint Conference on Artificial Intelligence.
    [18]
    Dan Guo, Wengang Zhou, Houqiang Li, and Meng Wang. 2018. Hierarchical LSTM for sign language translation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32.
    [19]
    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16).
    [20]
    Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Computat. 9, 8 (1997), 1735–1780. DOI:
    [21]
    Yushi Hu, Hang Hua, Zhengyuan Yang, Weijia Shi, Noah A. Smith, and Jiebo Luo. 2023. PromptCap: Prompt-Guided Task-aware Image Captioning. arxiv:2211.09699
    [22]
    Po-Yao Huang, Junjie Hu, Xiaojun Chang, and Alexander Hauptmann. 2020. Unsupervised multimodal neural machine translation with pseudo visual pivoting. arXiv preprint arXiv:2005.03119 (2020).
    [23]
    Julia Ive, Pranava Madhyastha, and Lucia Specia. 2019. Distilling translations with visual awareness. arXiv preprint arXiv:1906.07701 (2019).
    [24]
    Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15).
    [25]
    Diederik P. Kingma and Jimmy Ba. 2017. Adam: A Method for Stochastic Optimization. arxiv:1412.6980
    [26]
    Iro Laina, Christian Rupprecht, and Nassir Navab. 2019. Towards unsupervised image captioning with shared multimodal embeddings. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV’19).
    [27]
    Weiyu Lan, Xirong Li, and Jianfeng Dong. 2017. Fluency-guided cross-lingual image captioning. In Proceedings of the 25th ACM International Conference on Multimedia (MM’17). Association for Computing Machinery, New York, NY, 1549–1557. DOI:
    [28]
    Guozhang Li, De Cheng, Xinpeng Ding, Nannan Wang, Xiaoyu Wang, and Xinbo Gao. 2023. Boosting weakly-supervised temporal action localization with text information. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10648–10657.
    [29]
    Kunpeng Li, Chang Liu, Mike Stopa, Jun Amano, and Yun Fu. 2023. Guided graph attention learning for video-text matching. ACM Trans. Multim. Comput., Commun. Applic. 18, 2s (2023), 1–23.
    [30]
    Xirong Li, Weiyu Lan, Jianfeng Dong, and Hailong Liu. 2016. Adding Chinese captions to images. In Proceedings of the ACM on International Conference on Multimedia Retrieval (ICMR’16). Association for Computing Machinery, New York, NY, 271–275. DOI:
    [31]
    Xirong Li, Chaoxi Xu, Xiaoxu Wang, Weiyu Lan, Zhengxiong Jia, Gang Yang, and Jieping Xu. 2019. COCO-CN for cross-lingual image tagging, captioning, and retrieval. IEEE Trans. Multim. 21, 9 (2019), 2347–2360. DOI:
    [32]
    Yehao Li, Yingwei Pan, Ting Yao, and Tao Mei. 2022. Comprehending and ordering semantics for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’22). 17990–17999.
    [33]
    Yi Li, Rameswar Panda, Yoon Kim, Chun-Fu Richard Chen, Rogerio S. Feris, David Cox, and Nuno Vasconcelos. 2022. VALHALLA: Visual hallucination for machine translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5216–5226.
    [34]
    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common objects in context. In Proceedings of the European Conference on Computer Vision (ECCV’14), David Fleet, Tomas Pajdla, Bernt Schiele, and Tinne Tuytelaars (Eds.). Springer International Publishing, Cham, 740–755.
    [35]
    Chenyang Liu, Rui Zhao, Hao Chen, Zhengxia Zou, and Zhenwei Shi. 2022. Remote sensing image change captioning with dual-branch transformers: A new method and a large scale dataset. IEEE Trans. Geosci. Rem. Sens. 60 (2022), 1–20. DOI:
    [36]
    Fenglin Liu, Meng Gao, Tianhao Zhang, and Yuexian Zou. 2019. Exploring semantic relationships for image captioning without parallel data. In Proceedings of the IEEE International Conference on Data Mining (ICDM’19). 439–448. DOI:
    [37]
    Xihui Liu, Hongsheng Li, Jing Shao, Dapeng Chen, and Xiaogang Wang. 2018. Show, tell and discriminate: Image captioning by self-retrieval with partially labeled data. In Proceedings of the European Conference on Computer Vision (ECCV’18).
    [38]
    Yu Liu, Yanming Guo, Erwin M. Bakker, and Michael S. Lew. 2017. Learning a recurrent residual fusion network for multimodal matching. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’17).
    [39]
    Quanyu Long, Mingxuan Wang, and Lei Li. 2020. Generative imagination elevates machine translation. arXiv preprint arXiv:2009.09654 (2020).
    [40]
    Srikanth Malla, Chiho Choi, Isht Dwivedi, Joon Hee Choi, and Jiachen Li. 2023. DRAMA: Joint risk localization and captioning in driving. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV’23). 1043–1052.
    [41]
    Takashi Miyazaki and Nobuyuki Shimizu. 2016. Cross-lingual image caption generation. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. 1780–1052.
    [42]
    Hideki Nakayama and Noriki Nishida. 2017. Zero-resource machine translation by multimodal encoder–decoder network with multimedia pivot. Mach. Transl. 31 (2017), 49–64.
    [43]
    Weizhi Nie, Jiesi Li, Ning Xu, An-An Liu, Xuanya Li, and Yongdong Zhang. 2021. Triangle-reward reinforcement learning: A visual-linguistic semantic alignment for image captioning. In Proceedings of the 29th ACM International Conference on Multimedia (MM’21). Association for Computing Machinery, New York, NY, 4510–4518. DOI:
    [44]
    Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. 311–318.
    [45]
    Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba. 2015. Sequence level training with recurrent neural networks. CoRR abs/1511.06732 (2015).
    [46]
    Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems, C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett (Eds.), Vol. 28. Curran Associates, Inc. Retrieved from https://proceedings.neurips.cc/paper_files/paper/2015/file/14bfa6bb14875e45bba028a21ed38046-Paper.pdf
    [47]
    Steven J. Rennie, Etienne Marcheret, Youssef Mroueh, Jerret Ross, and Vaibhava Goel. 2017. Self-critical sequence training for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17).
    [48]
    Zhuang Shao, Jungong Han, Kurt Debattista, and Yanwei Pang. 2023. Textual context-aware dense captioning with diverse words. IEEE Trans. Multim. (2023), 1–15. DOI:
    [49]
    David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel, and Demis Hassabis. 2016. Mastering the game of Go with deep neural networks and tree search. Nature 529, 7587 (2016), 484–489.
    [50]
    Peipei Song, Dan Guo, Jinxing Zhou, Mingliang Xu, and Meng Wang. 2022. Memorial GAN with joint semantic optimization for unpaired image captioning. IEEE Trans. Cybern. 44, 8 (2022), 4065–4080. DOI:
    [51]
    Yuqing Song, Shizhe Chen, Yida Zhao, and Qin Jin. 2019. Unpaired cross-lingual image caption generation with self-supervised rewards. In Proceedings of the 27th ACM International Conference on Multimedia (MM’19). Association for Computing Machinery, New York, NY, 784–792. DOI:
    [52]
    Lucia Specia, Stella Frank, Khalil Sima’an, and Desmond Elliott. 2016. A shared task on multimodal machine translation and crosslingual image description. In Proceedings of the 1st Conference on Machine Translation. 543–553.
    [53]
    Yuanhang Su, Kai Fan, Nguyen Bach, C.-C. Jay Kuo, and Fei Huang. 2019. Unsupervised multi-modal neural machine translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10482–10491.
    [54]
    Richard S. Sutton, David McAllester, Satinder Singh, and Yishay Mansour. 1999. Policy gradient methods for reinforcement learning with function approximation. In Advances in Neural Information Processing Systems, S. Solla, T. Leen, and K. Müller (Eds.), Vol. 12. MIT Press. Retrieved from https://proceedings.neurips.cc/paper_files/paper/1999/file/464d828b85b0bed98e80ade0a5c43b0f-Paper.pdf
    [55]
    Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 11 (2008).
    [56]
    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30. Curran Associates, Inc. Retrieved from https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
    [57]
    Ramakrishna Vedantam, C. Lawrence Zitnick, and Devi Parikh. 2015. CIDEr: Consensus-based image description evaluation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15).
    [58]
    Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15).
    [59]
    Shuo Wang, Dan Guo, Wen-gang Zhou, Zheng-Jun Zha, and Meng Wang. 2018. Connectionist temporal fusion for sign language translation. In Proceedings of the 26th ACM International Conference on Multimedia. 1483–1491.
    [60]
    Jiahong Wu, He Zheng, Bo Zhao, Yixin Li, Baoming Yan, Rui Liang, Wenjia Wang, Shipei Zhou, Guosen Lin, Yanwei Fu, Yizhou Wang, and Yonggang Wang. 2019. Large-scale datasets for going deeper in image understanding. In Proceedings of the IEEE International Conference on Multimedia and Expo (ICME’19). IEEE. DOI:
    [61]
    Siying Wu, Zheng-Jun Zha, Zilei Wang, Houqiang Li, and Feng Wu. 2019. Densely supervised hierarchical policy-value network for image paragraph generation. In Proceedings of the International Joint Conferences on Artificial Intelligence (IJCAI’19). 975–981.
    [62]
    Ning Xu, Hanwang Zhang, An-An Liu, Weizhi Nie, Yuting Su, Jie Nie, and Yongdong Zhang. 2020. Multi-level policy and reward-based deep reinforcement learning framework for image captioning. IEEE Trans. Multim. 22, 5 (2020), 1372–1383. DOI:
    [63]
    Zhe Xu, Da Chen, Kun Wei, Cheng Deng, and Hui Xue. 2022. HiSA: Hierarchically semantic associating for video temporal grounding. IEEE Trans. Image Process. 31 (2022), 5178–5188.
    [64]
    Xun Yang, Jianfeng Dong, Yixin Cao, Xun Wang, Meng Wang, and Tat-Seng Chua. 2020. Tree-augmented cross-modal encoding for complex-query video retrieval. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 1339–1348.
    [65]
    Xun Yang, Fuli Feng, Wei Ji, Meng Wang, and Tat-Seng Chua. 2021. Deconfounded video moment retrieval with causal intervention. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1–10.
    [66]
    Xun Yang, Xueliang Liu, Meng Jian, Xinjian Gao, and Meng Wang. 2020. Weakly-supervised video object grounding by exploring spatio-temporal contexts. In Proceedings of the 28th ACM International Conference on Multimedia. 1939–1947.
    [67]
    Junjie Ye, Junjun Guo, Yan Xiang, Kaiwen Tan, and Zhengtao Yu. 2022. Noise-robust cross-modal interactive learning with text2image mask for multi-modal neural machine translation. In Proceedings of the 29th International Conference on Computational Linguistics. 5098–5108.
    [68]
    Yan Zeng, Wangchunshu Zhou, Ao Luo, and Xinsong Zhang. 2022. Cross-view language modeling: Towards unified cross-lingual cross-modal pre-training. arXiv preprint arXiv:2206.00621 (2022).
    [69]
    Bin Zhang, Lixin Zhou, Sifan Song, Lifu Chen, Zijian Jiang, and Jiaming Zhang. 2020. Image captioning in Chinese and its application for children with autism spectrum disorder. In Proceedings of the 12th International Conference on Machine Learning and Computing (ICMLC’20). Association for Computing Machinery, New York, NY, 426–432. DOI:
    [70]
    Mingrui Ray Zhang, Mingyuan Zhong, and Jacob O. Wobbrock. 2022. Ga11y: An automated GIF annotation system for visually impaired users. In Proceedings of the CHI Conference on Human Factors in Computing Systems (CHI’22). Association for Computing Machinery, New York, NY, Article 197, 16 pages. DOI:
    [71]
    Yuanen Zhou, Meng Wang, Daqing Liu, Zhenzhen Hu, and Hanwang Zhang. 2020. More grounded image captioning by distilling image-text matching model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’20).
    [72]
    Peipei Zhu, Xiao Wang, Yong Luo, Zhenglong Sun, Wei-Shi Zheng, Yaowei Wang, and Changwen Chen. 2022. Unpaired image captioning by image-level weakly-supervised visual concept recognition. IEEE Trans. Multim. 25 (2022), 6702–6716. DOI:

    Cited By

    View all
    • (2024)Cross-Lingual Transfer Learning in NLP: Enhancing English Language Learning for Non-Native Speakers2024 10th International Conference on Communication and Signal Processing (ICCSP)10.1109/ICCSP60870.2024.10544031(1042-1047)Online publication date: 12-Apr-2024

    Index Terms

    1. Visual-linguistic-stylistic Triple Reward for Cross-lingual Image Captioning

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Transactions on Multimedia Computing, Communications, and Applications
      ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 20, Issue 4
      April 2024
      676 pages
      ISSN:1551-6857
      EISSN:1551-6865
      DOI:10.1145/3613617
      • Editor:
      • Abdulmotaleb El Saddik
      Issue’s Table of Contents

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 11 January 2024
      Online AM: 28 November 2023
      Accepted: 24 November 2023
      Revised: 21 September 2023
      Received: 30 April 2023
      Published in TOMM Volume 20, Issue 4

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Cross-lingual
      2. image captioning
      3. triple reward
      4. semantic matching

      Qualifiers

      • Research-article

      Funding Sources

      • National Key R&D Program of China
      • National Natural Science Foundation of China
      • Major Project of Anhui Province

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)230
      • Downloads (Last 6 weeks)28
      Reflects downloads up to 27 Jul 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Cross-Lingual Transfer Learning in NLP: Enhancing English Language Learning for Non-Native Speakers2024 10th International Conference on Communication and Signal Processing (ICCSP)10.1109/ICCSP60870.2024.10544031(1042-1047)Online publication date: 12-Apr-2024

      View Options

      Get Access

      Login options

      Full Access

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Full Text

      View this article in Full Text.

      Full Text

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media