Abstract
Automatic image captioning is an interesting task that lies at the intersection of computer vision and natural language processing. Although image captioning based on reinforcement learning has made significant progress in the past few years, the problem of inconsistent evaluation indicators for training and testing remains. Reinforcement learning optimizes a single metric, and the caption generated by the model is monotonous and non-characteristics. The model cannot reflect the diversity among images. In response to the above problems, we design a novel image captioning model based on lightweight spatial attention and a generative adversarial network. The lightweight spatial attention module discards the coarse-grained approach of maximum pooling after convolution and transforms the spatial information to preserve key information in the feature map. Then, the game mechanism between the generator and the discriminator is used to optimize the evaluation metric of the model. Finally, we design a discriminator network that cooperates with reinforcement learning to update the model parameters and objectively optimize the language metric inconsistencies between the evaluation and test indicators. We verified the effectiveness of the proposed model on the MS-COCO and Flickr 30K datasets. The experimental results show that the model proposed in this paper achieves state-of-the-art results.







Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Availability of Data and Material
We all make sure that all data and materials support our published claims and comply with field standards.
Code Availability
We are preparing to upload the model code to GitHub.
References
Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2018) Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6077–6086
Babu KK, Dubey SR (2021) Csgan: Cyclic-synthesized generative adversarial networks for image-to-image transformation. Expert Syst Appl 169(114):431
Bodapati JD (2021) Sae-pd-seq: sequence autoencoder-based pre-training of decoder for sequence learning tasks. SIViP, pp 1–7
Cao T, Han K, Wang X, Ma L, Fu Y, Jiang YG, Xue X (2020) Feature deformation meta-networks in image captioning of novel objects. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 10,494–10,501
do Carmo Nogueira T, Vinhal CDN, da Cruz Júnior G, Ullmann MRD (2020) Reference-based model using multimodal gated recurrent units for image captioning. Multimedia Tools and Applications 79 (41):30,615–30,635
Chen J, Jin Q (2020) Better captioning with sequence-level exploration. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10,890–10,899
Chen S, Jin Q, Wang P, Wu Q (2020) Say as you wish: Fine-grained control of image caption generation with abstract scene graphs. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9962–9971
Han HY, Chen YC, Hsiao PY, Fu LC (2020) Using channel-wise attention for deep cnn based real-time semantic segmentation with class-aware edge information IEEE Transactions on Intelligent Transportation Systems
He J, Zhao Y, Sun B, Yu L (2020) Feedback evaluations to promote image captioning. IET Image Process 14(13):3021–3027
He S, Lu Y, Chen S (2021) Image captioning algorithm based on multi-branch cnn and bi-lstm. IEICE Trans Inf Syst 104(7):941–947
Hu T, Long C, Xiao C (2021) A novel visual representation on text using diverse conditional gan for visual recognition. IEEE Trans Image Process 30:3499–3512
Huang F, Li X, Yuan C, Zhang S, Zhang J, Qiao S (2021) Attention-emotion-enhanced convolutional lstm for sentiment analysis IEEE Transactions on Neural Networks and Learning Systems
Huang Y, Chen J, Ouyang W, Wan W, Xue Y (2020) Image captioning with end-to-end attribute detection and subsequent attributes prediction. IEEE Trans Image Process 29:4013– 4026
Ji J, Du Z, Zhang X (2021) Divergent-convergent attention for image captioning. Pattern Recogn 115(107):928
Li W, Wang Q, Wu J, Yu Z (2021) Piecewise convolutional neural networks with position attention and similar bag attention for distant supervision relation extraction. Appl Intell, pp 1–11
Liu H, Nie H, Zhang Z, Li YF (2021) Anisotropic angle distribution learning for head pose estimation and attention understanding in human-computer interaction. Neurocomputing 433:310–322
Liu H, Zhang S, Lin K, Wen J, Li J, Hu X (2021) Vocabulary-wide credit assignment for training image captioning models. IEEE Trans Image Process 30:2450–2460
Liu M, Li L, Hu H, Guan W, Tian J (2020) Image caption generation with dual attention mechanism. Information Processing & Management 57(2):102,178
Lu H, Yang R, Deng Z, Zhang Y, Gao G, Lan R (2021) Chinese image captioning via fuzzy attention-based densenet-bilstm. ACM Transactions on Multimedia Computing. Communications, and Applications (TOMM) 17(1s):1–18
Rennie SJ, Marcheret E, Mroueh Y, Ross J, Goel V (2017) Self-critical sequence training for image captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7008–7024
Sharma R, Kumar A, Meena D, Pushp S (2020) Employing differentiable neural computers for image captioning and neural machine translation. Procedia Computer Science 173:234– 244
Shi T, Keneshloo Y, Ramakrishnan N, Reddy CK (2021) Neural abstractive text summarization with sequence-to-sequence models. ACM Transactions on Data Science 2(1):1–37
Sun B, Wu Y, Zhao K, He J, Yu L, Yan H, Luo A (2021) Student class behavior dataset: a video dataset for recognizing, detecting, and captioning students’ behaviors in classroom scenes. Neural Comput & Applic, pp 1–20
Sun C, Ai Y, Wang S, Zhang W (2021) Mask-guided ssd for small-object detection. Appl Intell 51(6):3311–3322
Wei Y, Wang L, Cao H, Shao M, Wu C (2020) Multi-attention generative adversarial network for image captioning. Neurocomputing 387:91–99
Yan S, Wu F, Smith JS, Lu W, Zhang B (2018) Image captioning using adversarial networks and reinforcement learning. In: 2018 24th International Conference on Pattern Recognition (ICPR), pp. 248–253. IEEE
Wang S, Lan L, Zhang X, Dong G, Luo Z (2020) Object-aware semantics of attention for image captioning. Multimedia Tools and Applications 79(3):2013–2030
Xu M, Fu P, Liu B, Yin H, Li J (2021) A novel dynamic graph evolution network for salient object detection. Appl Intell, pp 1–18
Xu X, Wang T, Yang Y, Zuo L, Shen F, Shen HT (2020) Cross-modal attention with semantic consistence for image–text matching. IEEE transactions on neural networks and learning systems 31 (12):5412–5425
Yang S, Niu J, Wu J, Wang Y, Liu X, Li Q (2021) Automatic ultrasound image report generation with adaptive multimodal attention mechanism. Neurocomputing 427:40–49
Yang X, Tang K, Zhang H, Cai J (2019) Auto-encoding scene graphs for image captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10,685–10,694
Yang X, Zhang H, Cai J (2019) Learning to collocate neural modules for image captioning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4250–4260
Yang X, Zhang H, Cai J (2020) Auto-encoding and distilling scene graphs for image captioning IEEE Transactions on Pattern Analysis and Machine Intelligence
Yuan J, Zhang L, Guo S, Xiao Y, Li Z (2020) Image captioning with a joint attention mechanism by visual concept samples. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 16(3):1–22
Zhang H, Le Z, Shao Z, Xu H, Ma J (2021) Mff-gan: an unsupervised generative adversarial network with adaptive and gradient joint constraints for multi-focus image fusion. Information Fusion 66:40–53
Zhang Y, Shi X, Mi S, Yang X (2021) Image captioning with transformer and knowledge graph. Pattern Recogn Lett 143:43–49
Zhang Z, Wu Q, Wang Y, Chen F (2021) Exploring region relationships implicitly: Image captioning with visual relationship attention. Image and Vision Computing p 104146
Zhong X, Nie G, Huang W, Liu W, Ma B, Lin CW (2021) Attention-guided image captioning with adaptive global and local feature fusion. Journal of Visual Communication and Image Representation p 103138
Zhou Y, Wang M, Liu D, Hu Z, Zhang H (2020) More grounded image captioning by distilling image-text matching model. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4777–4786
Zhu H, Wang R, Zhang X (2021) Image captioning with dense fusion connection and improved stacked attention module. Neural Process Lett, pp 1–18
Funding
This work is supported by the National Natural Science Foundation of China (Nos. 61866014, 61663014, 61966005, 61962017, 617512134), the Guangxi Natural Science Foundation (Nos. 2018GXNSFDA281019, 2017GXNSFAA198315, 2016GXNSFAA380156, 2018GXNSFDA294011).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Ethics approval
This paper strictly abides by the moral standards of this journal.
Consent to Participate
All the authors of this paper have reviewed and agreed to contribute to your journal by consensus.
Consent for Publication
Once this paper is hired, we agree to publish it in your journal.
Conflict of Interests
No conflict of interest exits in the submission of this manuscript, and manuscript is approved by all authors for publication.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Zhou, D., Yang, J. & Bao, R. Collaborative strategy network for spatial attention image captioning. Appl Intell 52, 9017–9032 (2022). https://doi.org/10.1007/s10489-021-02943-w
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-021-02943-w