research-article

Uncertainty-aware image captioning

AUTHORs:

Xiaolin WeiAuthors Info & Claims

AAAI'23/IAAI'23/EAAI'23: Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence and Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence and Thirteenth Symposium on Educational Advances in Artificial Intelligence

Article No.: 68, Pages 614 - 622

https://doi.org/10.1609/aaai.v37i1.25137

Published: 07 February 2023 Publication History

Abstract

It is well believed that the higher uncertainty in a word of the caption, the more inter-correlated context information is required to determine it. However, current image captioning methods usually consider the generation of all words in a sentence sequentially and equally. In this paper, we propose an uncertainty-aware image captioning framework, which parallelly and iteratively operates insertion of discontinuous candidate words between existing words from easy to difficult until converged. We hypothesize that high-uncertainty words in a sentence need more prior information to make a correct decision and should be produced at a later stage. The resulting non-autoregressive hierarchy makes the caption generation explainable and intuitive. Specifically, we utilize an image-conditioned bag-of-word model to measure the word uncertainty and apply a dynamic programming algorithm to construct the training pairs. During inference, we devise an uncertainty-adaptive parallel beam search technique that yields an empirically logarithmic time complexity. Extensive experiments on the MS COCO benchmark reveal that our approach outperforms the strong baseline and related methods on both captioning quality as well as decoding speed.

References

[1]

Anderson, P.; Fernando, B.; Johnson, M.; and Gould, S. 2016. SPICE: Semantic Propositional Image Caption Evaluation. In Proc. ECCV, 382-398.

[2]

Anderson, P.; He, X.; Buehler, C.; Teney, D.; Johnson, M.; Gould, S.; and Zhang, L. 2018. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering. In Proc. IEEE CVPR, 6077-6080.

[3]

Chan, W.; Saharia, C.; Hinton, G.; Norouzi, M.; and Jaitly, N. 2020. Imputer: Sequence modelling via imputation and dynamic programming. In Proc. ICML, 1403-1413. PMLR.

[4]

Chen, X.; Fang, H.; Lin, T.-Y.; Vedantam, R.; Gupta, S.; Dollar, P.; and Zitnick, C. L. 2015. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325.

[5]

Cornia, M.; Stefanini, M.; Baraldi, L.; and Cucchiara, R. 2020. Meshed-Memory Transformer for Image Captioning. In Proc. IEEE CVPR, 10578-10587.

[6]

Fei, Z. 2020. Iterative Back Modification for Faster Image Captioning. In Proc. ACM MM, 3182-3190.

[7]

Fei, Z. 2021a. Memory-augmented image captioning. In Proc. AAAI, volume 35, 1317-1324.

[8]

Fei, Z. 2021b. Partially non-autoregressive image captioning. In Proc. AAAI, volume 35, 1309-1316.

[9]

Fei, Z. 2022. Attention-Aligned Transformer for Image Captioning. In Proc. AAAI, 1-10.

[10]

Fei, Z.; Huang, J.; Wei, X.; and Wei, X. 2022a. Efficient Modeling of Future Context for Image Captioning. arXiv preprint arXiv:2207.10897.

[11]

Fei, Z.; Yan, X.; Wang, S.; and Tian, Q. 2022b. DeeCap: Dynamic Early Exiting for Efficient Image Captioning. In Proc. IEEE CVPR, 12216-12226.

[12]

Fei, Z.-c. 2019. Fast Image Caption Generation with Position Alignment. arXiv preprint arXiv:1912.06365.

[13]

Gao, J.; Meng, X.; Wang, S.; Li, X.; Wang, S.; Ma, S.; and Gao, W. 2019. Masked Non-Autoregressive Image Captioning. arXiv preprint arXiv:1906.00717.

[14]

Ghazvininejad, M.; Levy, O.; and Zettlemoyer, L. 2020. Semi-autoregressive training improves mask-predict decoding. arXiv preprint arXiv:2001.08785.

[15]

Gu, J.; Bradbury, J.; Xiong, C.; Li, V. O. K.; and Socher, R. 2018. Non-Autoregressive Neural Machine Translation. In Proc. ICLR.

[16]

Gu, J.; Wang, C.; and Zhao, J. 2019. Levenshtein transformer. In Proc. NIPS, 11181-11191.

[17]

Guo, L.; Liu, J.; Zhu, X.; He, X.; Jiang, J.; and Lu, H. 2020. Non-Autoregressive Image Captioning with Counterfactuals-Critical Multi-Agent Learning. arXiv preprint arXiv:2005.04690.

[18]

Guo, L.; Liu, J.; Zhu, X.; He, X.; Jiang, J.; and Lu, H. 2021. Non-autoregressive image captioning with counterfactuals-critical multi-agent learning. In Proc. IJCAI, 767-773.

[19]

Hama, K.; Matsubara, T.; Uehara, K.; and Cai, J. 2019. Exploring uncertainty measures for image-caption embedding-and-retrieval task. arXiv preprint arXiv:1904.08504.

[20]

He, T.; Zhang, Z.; Zhang, H.; Zhang, Z.; Xie, J.; and Li, M. 2019. Bag of tricks for image classification with convolutional neural networks. In Proc. IEEE CVPR, 558-567.

[21]

Heo, J.; Lee, H. B.; Kim, S.; Lee, J.; Kim, K. J.; Yang, E.; and Hwang, S. J. 2018. Uncertainty-aware attention for reliable interpretation and prediction. In Proc. NIPS, 909-918.

[22]

Huang, L.; Wang, W.; Chen, J.; and Wei, X.-Y. 2019. Attention on attention for image captioning. In Proc. IEEE ICCV, 4634-4643.

[23]

Jason, L.; Elman, M.; Neubig, G.; and Kyunghyun, C. 2018. Deterministic Non-Autoregressive Neural Sequence Modeling by Iterative Refinement. In Proc. EMNLP, 1138-1149.

[24]

Jiang, H.; Misra, I.; Rohrbach, M.; Learned-Miller, E.; and Chen, X. 2020. In defense of grid features for visual question answering. In Proc. IEEE CVPR, 10267-10276.

[25]

Kaplan, L.; Cerutti, F.; Sensoy, M.; Preece, A.; and Sullivan, P. 2018. Uncertainty aware AI ML: why and how. arXiv preprint arXiv:1809.07882.

[26]

Karpathy, A.; and Fei-Fei, L. 2015. Deep visual-semantic alignments for generating image descriptions. In Proc. IEEE CVPR, 3128-3137.

[27]

Kingma, D. P.; and Ba, J. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.

[28]

Lavie, A.; and Agarwal, A. 2007. METEOR: An automatic metric for MT evaluation with high levels of correlation with human judgments. In Proc. ACL Workshop, 228-231.

[29]

Li, Y.; Pan, Y.; Yao, T.; and Mei, T. 2022. Comprehending and Ordering Semantics for Image Captioning. In Proc. IEEE CVPR, 17990-17999.

[30]

Lin, C.-Y. 2004. ROUGE: A Package for Automatic Evaluation of summaries. In Proc. ACL Workshops, 74-81.

[31]

Pan, Y.; Yao, T.; Li, Y.; and Mei, T. 2020. X-linear attention networks for image captioning. In Proc. IEEE CVPR, 10971-10980.

[32]

Papineni, K.; Roukos, S.; Ward, T.; and Zhu, W. J. 2002. BLEU: a Method for Automatic Evaluation of Machine Translation. In Proc. ACL, 311-318.

[33]

Quach, V. 2020. Blank Language Model: flexible sequence modeling by any-order generation. Ph.D. thesis, Massachusetts Institute of Technology.

[34]

Rennie, S. J.; Marcheret, E.; Mroueh, Y.; Ross, J.; and Goel, V. 2017. Self-Critical Sequence Training for Image Captioning. In Proc. IEEE CVPR, 1179-1195.

[35]

Stefanini, M.; Cornia, M.; Baraldi, L.; Cascianelli, S.; Fiameni, G.; and Cucchiara, R. 2021. From show to tell: A survey on image captioning. arXiv preprint arXiv:2107.06912.

[36]

Stern, M.; Chan, W.; Kiros, J.; and Uszkoreit, J. 2019. Insertion Transformer: Flexible Sequence Generation via Insertion Operations. In Proc. ICML, 5976-5985.

[37]

Takaoka, T. 2002. Efficient algorithms for the maximum subarray problem by distance matrix multiplication. Electronic Notes in Theoretical Computer Science, 61: 191-200.

[38]

Vanderbei, R. J.; et al. 2015. Linear programming, volume 3. Springer.

[39]

Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, L.; and Polosukhin, I. 2017. Attention Is All You Need. In Proc. NIPS, 5998-6008.

[40]

Vedantam, R.; Lawrence Zitnick, C.; and Parikh, D. 2015. Cider: Consensus-based image description evaluation. In Proc. IEEE CVPR, 4566-4575.

[41]

Vinyals, O.; Toshev, A.; Bengio, S.; and Erhan, D. 2015. Show and tell: A neural image caption generator. In Proc. IEEE CVPR, 3156-3164.

[42]

Wang, C.; Zhang, J.; and Chen, H. 2018. Semi-autoregressive neural machine translation. arXiv preprint arXiv:1808.08583.

[43]

Wei, B.; Wang, M.; Zhou, H.; Lin, J.; Xie, J.; and Sun, X. 2019. Imitation learning for non-autoregressive neural machine translation. arXiv preprint arXiv:1906.02041.

[44]

Wu, L.; Hoi, S. C.; and Yu, N. 2010. Semantics-preserving bag-of-words models and applications. IEEE Transactions on Image Processing, 19(7): 1908-1920.

Digital Library

[45]

Wu, M.; Zhang, X.; Sun, X.; Zhou, Y.; Chen, C.; Gu, J.; Sun, X.; and Ji, R. 2022. DIFNet: Boosting Visual Information Flow for Image Captioning. In Proc. IEEE CVPR, 1802018029.

[46]

Xiong, Y.; Liao, R.; Zhao, H.; Hu, R.; Bai, M.; Yumer, E.; and Urtasun, R. 2019. Upsnet: A unified panoptic segmentation network. In Proc. IEEE CVPR, 8818-8826.

[47]

Xu, K.; Ba, J.; Kiros, R.; Cho, K.; Courville, A.; Salakhutdinov, R.; Zemel, R.; and Bengio, Y. 2015. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. In Proc. ICML, 2048-2057.

[48]

Yan, X.; Fei, Z.; Li, Z.; Wang, S.; Huang, Q.; and Tian, Q. 2021. Semi-Autoregressive Image Captioning. In Proc. ACM MM, 2708-2716.

[49]

Yang, X.; Liu, Y.; and Wang, X. 2022. Reformer: The relational transformer for image captioning. In Proc. ACM MM, 5398-5406.

[50]

Yao, T.; Pan, Y.; Li, Y.; and Mei, T. 2018. Exploring Visual Relationship for Image Captioning. In Proc. ECCV, 684-699.

[51]

Zhang, X.; Sun, X.; Luo, Y.; Ji, J.; Zhou, Y.; Wu, Y.; Huang, F.; and Ji, R. 2021. RSTNet: Captioning with adaptive attention on visual and non-visual words. In Proc. IEEE CVPR, 15465-15474.

[52]

Zhang, Y.; Jin, R.; and Zhou, Z.-H. 2010. Understanding bag-of-words model: a statistical framework. International Journal of Machine Learning and Cybernetics, 1(1-4): 43-52.

[53]

Zhang, Y.; Wang, G.; Li, C.; Gan, Z.; Brockett, C.; and Dolan, B. 2020. POINTER: Constrained progressive text generation via insertion-based generative pre-training. arXiv preprint arXiv:2005.00558.

[54]

Zhou, Y.; Yang, B.; Wong, D. F.; Wan, Y.; and Chao, L. S. 2020. Uncertainty-aware curriculum learning for neural machine translation. In Proc. ACL, 6934-6944.

[55]

Zhou, Y.; Zhang, Y.; Hu, Z.; and Wang, M. 2021. Semi-autoregressive transformer for image captioning. In Proc. IEEE CVPR workshop, 3139-3143.

Recommendations

Fluency-Guided Cross-Lingual Image Captioning
MM '17: Proceedings of the 25th ACM international conference on Multimedia

Image captioning has so far been explored mostly in English, as most available datasets are in this language. However, the application of image captioning should not be restricted by language. Only few studies have been conducted for image captioning in ...
Dense Image Captioning in Hindi
2021 IEEE International Conference on Systems, Man, and Cybernetics (SMC)
Much work has been done on dense image captioning in the English language. In this paper, we propose a Encoder-Decoder architecture for the first time, to the best of our knowledge, to generate Hindi dense image captions. We leverage the proven encoder-...
Tell as You Imagine: Sentence Imageability-Aware Image Captioning
MultiMedia Modeling
Abstract
Image captioning as a multimedia task is advancing in terms of performance in generating captions for general purposes. However, it remains difficult to tailor generated captions to different applications. In this paper, we propose a sentence ...

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings

AAAI'23/IAAI'23/EAAI'23: Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence and Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence and Thirteenth Symposium on Educational Advances in Artificial Intelligence

February 2023

16496 pages

ISBN:978-1-57735-880-0

Copyright © 2023 Association for the Advancement of Artificial Intelligence.

Sponsors

Association for the Advancement of Artificial Intelligence

Publisher

AAAI Press

Publication History

Published: 07 February 2023

Qualifiers

Research-article
Research
Refereed limited

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 02 Sep 2024

Other Metrics

View Author Metrics

Citations

View Options

View options

Media

Figures

Other

Tables

View Table of Contents