research-article

Learning Transferable Perturbations for Image Captioning

Authors:

Shengfeng HeAuthors Info & Claims

ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), Volume 18, Issue 2

Article No.: 57, Pages 1 - 18

https://doi.org/10.1145/3478024

Published: 16 February 2022 Publication History

Abstract

Present studies have discovered that state-of-the-art deep learning models can be attacked by small but well-designed perturbations. Existing attack algorithms for the image captioning task is time-consuming, and their generated adversarial examples cannot transfer well to other models. To generate adversarial examples faster and stronger, we propose to learn the perturbations by a generative model that is governed by three novel loss functions. Image feature distortion loss is designed to maximize the encoded image feature distance between original images and the corresponding adversarial examples at the image domain, and local-global mismatching loss is introduced to separate the mapping encoding representation of the adversarial images and the ground true captions from a local and global perspective in the common semantic space as far as possible cross image and caption domain. Language diversity loss is to make the image captions generated by the adversarial examples as different as possible from the correct image caption at the language domain. Extensive experiments show that our proposed generative model can efficiently generate adversarial examples that successfully generalize to attack image captioning models trained on unseen large-scale datasets or with different architectures, or even the image captioning commercial service.

References

[1]

Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. 2016. SPICE: Semantic propositional image caption evaluation. In ECCV.

[2]

Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In CVPR.

[3]

Shumeet Baluja and Ian Fischer. 2018. Learning to attack: Adversarial transformation networks. In AAAI.

Digital Library

[4]

Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In IEEvaluation@ACL.

[5]

Hongge Chen, Huan Zhang, Pin-Yu Chen, Jinfeng Yi, and Cho-Jui Hsieh. 2018. Show-and-Fool: Crafting adversarial examples for neural image captioning. In ACL.

[6]

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. 2009. ImageNet: A large-scale hierarchical image database. In CVPR.

[7]

Yinpeng Dong, Fangzhou Liao, Tianyu Pang, Hang Su, Jun Zhu, Xiaolin Hu, and Jianguo Li. 2018. Boosting adversarial attacks with momentum. In CVPR.

[8]

Chuang Gan, Zhe Gan, X. He, Jianfeng Gao, and L. Deng. 2017. StyleNet: Generating attractive visual captions with styles. In CVPR. 955–964.

[9]

Zhe Gan, Chuang Gan, X. He, Yunchen Pu, Kenneth Tran, Jianfeng Gao, L. Carin, and L. Deng. 2017. Semantic compositional networks for visual captioning. In CVPR. 1141–1150.

[10]

Ian Goodfellow, Jonathon Shlens, and Christian Szegedy. 2015. Explaining and harnessing adversarial examples. In ICLR.

[11]

Micah Hodosh, Peter Young, and Julia Hockenmaier. 2013. Framing image description as a ranking task: Data, models and evaluation metrics. In IJCAI 47 (2013), 853–899.

Digital Library

[12]

MD. Zakir Hossain, Ferdous Sohel, Mohd Fairuz Shiratuddin, and Hamid Laga. 2019. A comprehensive survey of deep learning for image captioning. J. Artif. Intell., Mach. Learn. Soft Comput.

[13]

Lun Huang, Wenmin Wang, Yaxian Xia, and Jie Chen. 2019. Adaptively aligned image captioning via adaptive attention time. In NeurIPS.

Digital Library

[14]

Di Jin, Zhijing Jin, Joey Tianyi Zhou, and Peter Szolovits. 2019. Is BERT really robust? Natural language attack on text classification and entailment. In AAAI.

[15]

Mert Kilickaya, Aykut Erdem, Nazli Ikizler-Cinbis, and Erkut Erdem. 2017. Re-evaluating automatic metrics for image captioning. In ACL.

[16]

Diederik Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. In ICLR (12 2014).

[17]

Guillaume Klein, Yoon Kim, Yuntian Deng, Jean Senellart, and Alexander M. Rush. 2017. OpenNMT: Open-source toolkit for neural machine translation. In ACL. Retrieved from https://doi.org/10.18653/v1/P17-4012.

[18]

Alex Kurakin, Dan Boneh, Florian Tramér, Ian Goodfellow, Nicolas Papernot, and Patrick McDaniel. 2018. Ensemble adversarial training: Attacks and defenses. In ICLR.

[19]

A. Kurakin, Ian J. Goodfellow, and S. Bengio. 2017. Adversarial machine learning at scale. ArXiv abs/1611.01236 (2017).

[20]

Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In ACL.

[21]

Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common objects in context. In ECCV.

[22]

Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. ViLBERT: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In NIPS. 13–23.

Digital Library

[23]

Jiasen Lu, Vedanuj Goswami, Marcus Rohrbach, Devi Parikh, and Stefan Lee. 2020. 12-in-1: Multi-task vision and language representation learning. In CVPR.

[24]

Jiasen Lu, Caiming Xiong, Devi Parikh, and Richard Socher. 2016. Knowing when to look: Adaptive attention via a visual sentinel for image captioning. In CVPR. 3242–3250.

[25]

Ruotian Luo, Brian L. Price, Scott Cohen, and Gregory Shakhnarovich. 2018. Discriminability objective for training descriptive captions. In CVPR. 6964–6974.

[26]

Chih-Yao Ma, Yannis Kalantidis, Ghassan AlRegib, Peter Vajda, Marcus Rohrbach, and Zsolt Kira. 2019. Learning to generate grounded image captions without localization supervision. In CVPR.

[27]

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2001. BLEU: A method for automatic evaluation of machine translation. In ACL.

Digital Library

[28]

Tingting Qiao, Jing Zhang, Duanqing Xu, and Dacheng Tao. 2019. MirrorGAN: Learning text-to-image generation by redescription. In CVPR.

[29]

Steven J. Rennie, Etienne Marcheret, Youssef Mroueh, Jerret Ross, and Vaibhava Goel. 2016. Self-critical sequence training for image captioning. In CVPR1179–1195.

[30]

Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian J. Goodfellow, and Rob Fergus. 2014. Intriguing properties of neural networks. In ICLR.

[31]

Ramakrishna Vedantam, C. Lawrence Zitnick, and Devi Parikh. 2014. CIDEr: Consensus-based image description evaluation. CVPR. 4566–4575.

[32]

Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. In CVPR.

[33]

Cihang Xie, Jianyu Wang, Zhishuai Zhang, Zhou Ren, and Alan Yuille. 2018. Mitigating adversarial effects through randomization. In ICLR.

[34]

Cihang Xie, Jianyu Wang, Zhishuai Zhang, Yuyin Zhou, Lingxi Xie, and Alan L. Yuille. 2017. Adversarial examples for semantic segmentation and object detection. ICCV. 1378–1387.

[35]

Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C. Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In ICML.

Digital Library

[36]

Tao Xu, Pengchuan Zhang, Qiuyuan Huang, Han Zhang, Zhe Gan, Xiaolei Huang, and Xiaodong He. 2018. AttnGAN: Fine-grained text to image generation with attentional generative adversarial networks. In CVPR.

[37]

Yan Xu, Baoyuan Wu, Fumin Shen, Yanbo Fan, Yong Zhang, Heng Tao Shen, and Wei Liu. 2019. Exact adversarial attack to image captioning via structured output learning with latent variables. In CVPR.

[38]

Hiromu Yakura and Jun Sakuma. 2019. Robust audio adversarial example for a physical attack. In IJCAI.

[39]

Deheng Ye, Zhao Liu, Mingfei Sun, Bei Shi, Peilin Zhao, Hao Wu, Hongsheng Yu, Shaojie Yang, Xipeng Wu, Qingwei Guo, Qiaobo Chen, Yinyuting Yin, Hao Zhang, Tengfei Shi, Liang Wang, Qiang Fu, Wei Yang, and Lanxiao Huang. 2019. Mastering complex control in MOBA games with deep reinforcement learning. In AAAI.

[40]

Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. 2014. From image descriptions to visual denotations: New Similarity metrics for semantic inference over event descriptions. Trans. Assoc. Comput. Ling. 2 (2014), 67–78.

[41]

X. Yuan, P. He, Q. Zhu, and X. Li. 2019. Adversarial examples: Attacks and defenses for deep learning. IEEE Trans. Neural Netw. Learn. Syst. 30, 9 (2019), 2805–2824.

[42]

S. Zhang, Z. Wang, X. Xu, X. Guan, and Y. Yang. 2020. Fooled by imagination: Adversarial attack to image captioning via perturbation in complex domain. In ICME. 1–6.

Cited By

Li JMao ZLi HChen WZhang Y(2024)Exploring Visual Relationships via Transformer-based Graphs for Enhanced Image CaptioningACM Transactions on Multimedia Computing, Communications, and Applications10.1145/363855820:5(1-23)Online publication date: 22-Jan-2024
https://dl.acm.org/doi/10.1145/3638558
Yao TPeng SWang LLi YSun Y(2024)Cross-modality interaction reasoning for enhancing vision-language pre-training in image-text retrievalApplied Intelligence10.1007/s10489-024-05823-154:23(12230-12245)Online publication date: 11-Sep-2024
https://doi.org/10.1007/s10489-024-05823-1
Li JWang YLi W(2023)Zero-shot Scene Graph Generation via Triplet Calibration and ReductionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/360428420:1(1-21)Online publication date: 8-Jun-2023
https://dl.acm.org/doi/10.1145/3604284
Show More Cited By

Index Terms

Learning Transferable Perturbations for Image Captioning
1. Security and privacy
  1. Software and application security
    1. Domain-specific security and privacy architectures

Recommendations

A Comprehensive Survey of Deep Learning for Image Captioning

Generating a description of an image is called image captioning. Image captioning requires recognizing the important objects, their attributes, and their relationships in an image. It also needs to generate syntactically and semantically correct ...
Learning Scene Graph for Better Cross-Domain Image Captioning
Pattern Recognition and Computer Vision
Abstract
The current image captioning (IC) methods achieve good results within a single domain primarily due to training on a large amount of annotated data. However, the performance of single-domain image captioning methods suffers when extended to new ...
A survey on deep neural network-based image captioning

Image captioning is a hot topic of image understanding, and it is composed of two natural parts ("look" and "language expression") which correspond to the two most important fields of artificial intelligence ("machine vision" and "natural language ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Multimedia Computing, Communications, and Applications

ACM Transactions on Multimedia Computing, Communications, and Applications Volume 18, Issue 2

May 2022

494 pages

ISSN:1551-6857

EISSN:1551-6865

DOI:10.1145/3505207

Editor:
Alberto Del Bimbo
University of Firenze, Italy

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 16 February 2022

Accepted: 01 July 2021

Revised: 01 June 2021

Received: 01 January 2021

Published in TOMM Volume 18, Issue 2

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Refereed

Funding Sources

National Natural Science Foundation of China
Guangdong International Science and Technology Cooperation Project
Guangdong Natural Science Foundation
Guangzhou Basic and Applied Research Project
CCF-Tencent Open Research fund

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

15
Total Citations
View Citations
578
Total Downloads

Downloads (Last 12 months)137
Downloads (Last 6 weeks)10

Reflects downloads up to 15 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Li JMao ZLi HChen WZhang Y(2024)Exploring Visual Relationships via Transformer-based Graphs for Enhanced Image CaptioningACM Transactions on Multimedia Computing, Communications, and Applications10.1145/363855820:5(1-23)Online publication date: 22-Jan-2024
https://dl.acm.org/doi/10.1145/3638558
Yao TPeng SWang LLi YSun Y(2024)Cross-modality interaction reasoning for enhancing vision-language pre-training in image-text retrievalApplied Intelligence10.1007/s10489-024-05823-154:23(12230-12245)Online publication date: 11-Sep-2024
https://doi.org/10.1007/s10489-024-05823-1
Li JWang YLi W(2023)Zero-shot Scene Graph Generation via Triplet Calibration and ReductionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/360428420:1(1-21)Online publication date: 8-Jun-2023
https://dl.acm.org/doi/10.1145/3604284
Chen ZYang MZhang S(2023)Complementary Coarse-to-Fine Matching for Video Object SegmentationACM Transactions on Multimedia Computing, Communications, and Applications10.1145/359649619:6(1-21)Online publication date: 12-Jul-2023
https://dl.acm.org/doi/10.1145/3596496
Niu TDong SChen ZLuo XGuo SHuang ZXu X(2023)Semantic Enhanced Video Captioning with Multi-feature FusionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/358857219:6(1-21)Online publication date: 20-Mar-2023
https://dl.acm.org/doi/10.1145/3588572
Niu TChen ZLuo XZhang PHuang ZXu X(2023)Video Captioning by Learning from Global Sentence and Looking AheadACM Transactions on Multimedia Computing, Communications, and Applications10.1145/358725219:5s(1-20)Online publication date: 7-Jun-2023
https://dl.acm.org/doi/10.1145/3587252
Xu YWei XDai PCao X(2023)A2SC: Adversarial Attacks on Subspace ClusteringACM Transactions on Multimedia Computing, Communications, and Applications10.1145/358709719:6(1-23)Online publication date: 12-Jul-2023
https://dl.acm.org/doi/10.1145/3587097
Wang JLing QLi P(2023)Robust Video Stabilization based on Motion DecompositionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/358049819:5(1-24)Online publication date: 16-Mar-2023
https://dl.acm.org/doi/10.1145/3580498
Abdussalam AYe ZHawbani AAl-Qatf MKhan R(2023)NumCap: A Number-controlled Multi-caption Image Captioning NetworkACM Transactions on Multimedia Computing, Communications, and Applications10.1145/357692719:4(1-24)Online publication date: 27-Feb-2023
https://dl.acm.org/doi/10.1145/3576927
Xiang CCheng WLin CZhang XLiu DZheng XLi Z(2023)LSTAloc: A Driver-Oriented Incentive Mechanism for Mobility-on-Demand Vehicular Crowdsensing MarketIEEE Transactions on Mobile Computing10.1109/TMC.2023.327167123:4(3106-3122)Online publication date: 1-May-2023
https://dl.acm.org/doi/10.1109/TMC.2023.3271671
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View full text|Download PDF

View Issue’s Table of Contents