Image Caption with Synchronous Cross-Attention

Published: 23 October 2017 Publication History


The image caption aims to translate images into descriptive sentences, involving both visual and textual resources. The Deep Neural Network (DNN) based models are widely applied to solve this task, due to their impressive performance in the computer vision and natural language processing. Specifically, the attention mechanism is proposed to allow the models to focus on the essential parts of images. However, the previous models ignore both the correlation between the attention at different time, and the supervision of words on attention selection. This paper proposes an Image Caption model with Synchronous Cross-Attention (IC-SCA), which captures a visual sequence of attention with the information of words. Our IC-SCA model has two stages, visual and textual, which jointly model the multimodal information to generate the descriptions. This model is evaluated on one of the largest datasets for image caption, namely the MS-COCO dataset. Experimental results on BLEU-1~4, METEOR and CIDEr metrics demonstrate that our IC-SCA model outperforms the benchmarks. By attention visualization, the effectiveness of our proposed mechanism is also verified.


  • (2024)A Novel Energy Saving Algorithm for Network Deep Learning Tasks2024 Sixth International Conference on Next Generation Data-driven Networks (NGDN)10.1109/NGDN61651.2024.10744084(339-343)Online publication date: 26-Apr-2024
  • (2023)Multi-Granularity Cross-Attention Network for Visual Question Answering2023 IEEE 22nd International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom)10.1109/TrustCom60117.2023.00291(2098-2103)Online publication date: 1-Nov-2023
  • (2022)A reference-based model using deep learning for image captioningMultimedia Systems10.1007/s00530-022-00937-329:3(1665-1681)Online publication date: 9-May-2022
Index Terms

  1. Image Caption with Synchronous Cross-Attention



    Thematic Workshops '17: Proceedings of the on Thematic Workshops of ACM Multimedia 2017
    October 2017
    558 pages
    Published: 23 October 2017


    Author Tags

    1. convolutional neural network
    2. deep learning
    3. image caption
    4. long short-term memory
    5. multimodal learning


    Funding Sources

    • National Science Fund
    • China 111 Project


    MM '17
    MM '17: ACM Multimedia Conference
    October 23 - 27, 2017
    California, Mountain View, USA


