SATNet: Captioning with Semantic Alignment and Feature Enhancement

Bai, Wenhui; Zhang, Canlong; Li, Zhixin; Wei, Peiyi; Wang, Zhiwen

doi:10.1007/978-3-031-30111-7_38

Wenhui Bai¹²,
Canlong Zhang^12,13,
Zhixin Li^12,13,
Peiyi Wei¹² &
…
Zhiwen Wang¹⁴

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13625))

Included in the following conference series:

International Conference on Neural Information Processing

1008 Accesses

Abstract

The fusion of region and grid features based on location alignment can make the utilization of image features better to a certain extent, thus improving the accuracy of image captioning. However, it still inevitably introduces semantic noise because of spatial misalignment. To address the problem, this paper proposes a novel image captioning model based on semantic alignment and feature enhancement, which contains a Visual Features Adaptive Alignment Module (VFAA) and a Features Enhancement Module (FEM). The VFAA module, at the encoder layer, utilizes Visual Semantic Graph (VSG) to generate pure semantic information for more accurately guiding the alignment and fusion of the region and grid features, thus further reducing the semantic noise caused by spatial dislocation. In addition, to ensure that the features that eventually enter the decoder layer do not lose their specific attributes, we design the FEM module to fuse the original region and grid features. To validate the effectiveness of the proposed model, we conduct extensive experiments on the MS-COCO dataset and test it on the online test server. The experimental results show that our model is superior to many state-of-the-art methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

GRPIC: an end-to-end image captioning model using three visual features

Article 04 September 2024

Dual visual align-cross attention-based image captioning transformer

Article 17 May 2024

Exploring better image captioning with grid features

Article Open access 10 February 2024

References

Anderson, P., Fernando, B., Johnson, M., Gould, S.: SPICE: semantic propositional image caption evaluation. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9909, pp. 382–398. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46454-1_24
Chapter Google Scholar
Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6077–6086 (2018)
Google Scholar
Banerjee, S., Lavie, A.: Meteor: an automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pp. 65–72 (2005)
Google Scholar
Cho, K., et al.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014)
Cornia, M., Stefanini, M., Baraldi, L., Cucchiara, R.: Meshed-memory transformer for image captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10578–10587 (2020)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Google Scholar
Herdade, S., Kappeler, A., Boakye, K., Soares, J.: Image captioning: transforming objects into words. In: Advances in Neural Information Processing Systems, vol. 32 (2019)
Google Scholar
Huang, L., Wang, W., Chen, J., Wei, X.Y.: Attention on attention for image captioning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4634–4643 (2019)
Google Scholar
Jiang, H., Misra, I., Rohrbach, M., Learned-Miller, E., Chen, X.: In defense of grid features for visual question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10267–10276 (2020)
Google Scholar
Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3128–3137 (2015)
Google Scholar
Lin, C.Y.: Rouge: a package for automatic evaluation of summaries. In: Text Summarization Branches Out, pp. 74–81 (2004)
Google Scholar
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Chapter Google Scholar
Luo, Y., et al.: Dual-level collaborative transformer for image captioning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2286–2293 (2021)
Google Scholar
Pan, Y., Yao, T., Li, Y., Mei, T.: X-linear attention networks for image captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10971–10980 (2020)
Google Scholar
Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp. 311–318 (2002)
Google Scholar
Rennie, S.J., Marcheret, E., Mroueh, Y., Ross, J., Goel, V.: Self-critical sequence training for image captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7008–7024 (2017)
Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Google Scholar
Vedantam, R., Lawrence Zitnick, C., Parikh, D.: Cider: consensus-based image description evaluation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4566–4575 (2015)
Google Scholar
Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3156–3164 (2015)
Google Scholar
Wang, W., Chen, Z., Hu, H.: Hierarchical attention network for image captioning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 8957–8964 (2019)
Google Scholar
Xian, T., Li, Z., Zhang, C., Ma, H.: Dual global enhanced transformer for image captioning. Neural Netw. 148, 129–141 (2022)
Article Google Scholar
Yang, X., Tang, K., Zhang, H., Cai, J.: Auto-encoding scene graphs for image captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10685–10694 (2019)
Google Scholar
Yao, T., Pan, Y., Li, Y., Mei, T.: Exploring visual relationship for image captioning. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 684–699 (2018)
Google Scholar
Zhang, X., et al.: RSTNet: captioning with adaptive attention on visual and non-visual words. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15465–15474 (2021)
Google Scholar

Download references

Acknowledgements

This work is supported by National Natural Science Foundation of China (Nos. 62266009, 61866004, 62276073, 61966004, 61962007), Guangxi Natural Science Foundation (Nos. 2018GXNSFDA281009, 2019GXNSF DA245018, 2018GXNSFDA294001), Guangxi “Bagui Scholar” Teams for Innovation and Research Project, Innovation Project of Guangxi Graduate Education (No. JXXYYJSCXXM-2021-013), and Guangxi Collaborative Innovation Center of Multi-source Information Integration and Intelligent Processing.

Author information

Authors and Affiliations

School of Computer Science and Engineering, Guangxi Normal University, Guilin, 541004, China
Wenhui Bai, Canlong Zhang, Zhixin Li & Peiyi Wei
Guangxi Key Lab of Multi-source Information Mining and Security, Guilin, 541004, China
Canlong Zhang & Zhixin Li
School of Computer Science and Technology, Guangxi University of Science and Technology, Liuzhou, 545006, China
Zhiwen Wang

Authors

Wenhui Bai
View author publications
You can also search for this author in PubMed Google Scholar
Canlong Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Zhixin Li
View author publications
You can also search for this author in PubMed Google Scholar
Peiyi Wei
View author publications
You can also search for this author in PubMed Google Scholar
Zhiwen Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Canlong Zhang .

Editor information

Editors and Affiliations

Indian Institute of Technology Indore, Indore, India
Mohammad Tanveer
Indian Institute of Information Technology - Allahabad, Prayagraj, India
Sonali Agarwal
Kobe University, Kobe, Japan
Seiichi Ozawa
Indian Institute of Technology Patna, Patna, India
Asif Ekbal
University of Innsbruck, Innsbruck, Austria
Adam Jatowt

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Bai, W., Zhang, C., Li, Z., Wei, P., Wang, Z. (2023). SATNet: Captioning with Semantic Alignment and Feature Enhancement. In: Tanveer, M., Agarwal, S., Ozawa, S., Ekbal, A., Jatowt, A. (eds) Neural Information Processing. ICONIP 2022. Lecture Notes in Computer Science, vol 13625. Springer, Cham. https://doi.org/10.1007/978-3-031-30111-7_38

Download citation

DOI: https://doi.org/10.1007/978-3-031-30111-7_38
Published: 13 April 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-30110-0
Online ISBN: 978-3-031-30111-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

SATNet: Captioning with Semantic Alignment and Feature Enhancement

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

GRPIC: an end-to-end image captioning model using three visual features

Dual visual align-cross attention-based image captioning transformer

Exploring better image captioning with grid features

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

SATNet: Captioning with Semantic Alignment and Feature Enhancement

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

GRPIC: an end-to-end image captioning model using three visual features

Dual visual align-cross attention-based image captioning transformer

Exploring better image captioning with grid features

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation