Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Learning Transferable Perturbations for Image Captioning

Published: 16 February 2022 Publication History

Abstract

Present studies have discovered that state-of-the-art deep learning models can be attacked by small but well-designed perturbations. Existing attack algorithms for the image captioning task is time-consuming, and their generated adversarial examples cannot transfer well to other models. To generate adversarial examples faster and stronger, we propose to learn the perturbations by a generative model that is governed by three novel loss functions. Image feature distortion loss is designed to maximize the encoded image feature distance between original images and the corresponding adversarial examples at the image domain, and local-global mismatching loss is introduced to separate the mapping encoding representation of the adversarial images and the ground true captions from a local and global perspective in the common semantic space as far as possible cross image and caption domain. Language diversity loss is to make the image captions generated by the adversarial examples as different as possible from the correct image caption at the language domain. Extensive experiments show that our proposed generative model can efficiently generate adversarial examples that successfully generalize to attack image captioning models trained on unseen large-scale datasets or with different architectures, or even the image captioning commercial service.

References

[1]
Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. 2016. SPICE: Semantic propositional image caption evaluation. In ECCV.
[2]
Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In CVPR.
[3]
Shumeet Baluja and Ian Fischer. 2018. Learning to attack: Adversarial transformation networks. In AAAI.
[4]
Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In IEEvaluation@ACL.
[5]
Hongge Chen, Huan Zhang, Pin-Yu Chen, Jinfeng Yi, and Cho-Jui Hsieh. 2018. Show-and-Fool: Crafting adversarial examples for neural image captioning. In ACL.
[6]
J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. 2009. ImageNet: A large-scale hierarchical image database. In CVPR.
[7]
Yinpeng Dong, Fangzhou Liao, Tianyu Pang, Hang Su, Jun Zhu, Xiaolin Hu, and Jianguo Li. 2018. Boosting adversarial attacks with momentum. In CVPR.
[8]
Chuang Gan, Zhe Gan, X. He, Jianfeng Gao, and L. Deng. 2017. StyleNet: Generating attractive visual captions with styles. In CVPR. 955–964.
[9]
Zhe Gan, Chuang Gan, X. He, Yunchen Pu, Kenneth Tran, Jianfeng Gao, L. Carin, and L. Deng. 2017. Semantic compositional networks for visual captioning. In CVPR. 1141–1150.
[10]
Ian Goodfellow, Jonathon Shlens, and Christian Szegedy. 2015. Explaining and harnessing adversarial examples. In ICLR.
[11]
Micah Hodosh, Peter Young, and Julia Hockenmaier. 2013. Framing image description as a ranking task: Data, models and evaluation metrics. In IJCAI 47 (2013), 853–899.
[12]
MD. Zakir Hossain, Ferdous Sohel, Mohd Fairuz Shiratuddin, and Hamid Laga. 2019. A comprehensive survey of deep learning for image captioning. J. Artif. Intell., Mach. Learn. Soft Comput.
[13]
Lun Huang, Wenmin Wang, Yaxian Xia, and Jie Chen. 2019. Adaptively aligned image captioning via adaptive attention time. In NeurIPS.
[14]
Di Jin, Zhijing Jin, Joey Tianyi Zhou, and Peter Szolovits. 2019. Is BERT really robust? Natural language attack on text classification and entailment. In AAAI.
[15]
Mert Kilickaya, Aykut Erdem, Nazli Ikizler-Cinbis, and Erkut Erdem. 2017. Re-evaluating automatic metrics for image captioning. In ACL.
[16]
Diederik Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. In ICLR (12 2014).
[17]
Guillaume Klein, Yoon Kim, Yuntian Deng, Jean Senellart, and Alexander M. Rush. 2017. OpenNMT: Open-source toolkit for neural machine translation. In ACL. Retrieved from https://doi.org/10.18653/v1/P17-4012.
[18]
Alex Kurakin, Dan Boneh, Florian Tramér, Ian Goodfellow, Nicolas Papernot, and Patrick McDaniel. 2018. Ensemble adversarial training: Attacks and defenses. In ICLR.
[19]
A. Kurakin, Ian J. Goodfellow, and S. Bengio. 2017. Adversarial machine learning at scale. ArXiv abs/1611.01236 (2017).
[20]
Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In ACL.
[21]
Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common objects in context. In ECCV.
[22]
Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. ViLBERT: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In NIPS. 13–23.
[23]
Jiasen Lu, Vedanuj Goswami, Marcus Rohrbach, Devi Parikh, and Stefan Lee. 2020. 12-in-1: Multi-task vision and language representation learning. In CVPR.
[24]
Jiasen Lu, Caiming Xiong, Devi Parikh, and Richard Socher. 2016. Knowing when to look: Adaptive attention via a visual sentinel for image captioning. In CVPR. 3242–3250.
[25]
Ruotian Luo, Brian L. Price, Scott Cohen, and Gregory Shakhnarovich. 2018. Discriminability objective for training descriptive captions. In CVPR. 6964–6974.
[26]
Chih-Yao Ma, Yannis Kalantidis, Ghassan AlRegib, Peter Vajda, Marcus Rohrbach, and Zsolt Kira. 2019. Learning to generate grounded image captions without localization supervision. In CVPR.
[27]
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2001. BLEU: A method for automatic evaluation of machine translation. In ACL.
[28]
Tingting Qiao, Jing Zhang, Duanqing Xu, and Dacheng Tao. 2019. MirrorGAN: Learning text-to-image generation by redescription. In CVPR.
[29]
Steven J. Rennie, Etienne Marcheret, Youssef Mroueh, Jerret Ross, and Vaibhava Goel. 2016. Self-critical sequence training for image captioning. In CVPR1179–1195.
[30]
Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian J. Goodfellow, and Rob Fergus. 2014. Intriguing properties of neural networks. In ICLR.
[31]
Ramakrishna Vedantam, C. Lawrence Zitnick, and Devi Parikh. 2014. CIDEr: Consensus-based image description evaluation. CVPR. 4566–4575.
[32]
Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. In CVPR.
[33]
Cihang Xie, Jianyu Wang, Zhishuai Zhang, Zhou Ren, and Alan Yuille. 2018. Mitigating adversarial effects through randomization. In ICLR.
[34]
Cihang Xie, Jianyu Wang, Zhishuai Zhang, Yuyin Zhou, Lingxi Xie, and Alan L. Yuille. 2017. Adversarial examples for semantic segmentation and object detection. ICCV. 1378–1387.
[35]
Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C. Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In ICML.
[36]
Tao Xu, Pengchuan Zhang, Qiuyuan Huang, Han Zhang, Zhe Gan, Xiaolei Huang, and Xiaodong He. 2018. AttnGAN: Fine-grained text to image generation with attentional generative adversarial networks. In CVPR.
[37]
Yan Xu, Baoyuan Wu, Fumin Shen, Yanbo Fan, Yong Zhang, Heng Tao Shen, and Wei Liu. 2019. Exact adversarial attack to image captioning via structured output learning with latent variables. In CVPR.
[38]
Hiromu Yakura and Jun Sakuma. 2019. Robust audio adversarial example for a physical attack. In IJCAI.
[39]
Deheng Ye, Zhao Liu, Mingfei Sun, Bei Shi, Peilin Zhao, Hao Wu, Hongsheng Yu, Shaojie Yang, Xipeng Wu, Qingwei Guo, Qiaobo Chen, Yinyuting Yin, Hao Zhang, Tengfei Shi, Liang Wang, Qiang Fu, Wei Yang, and Lanxiao Huang. 2019. Mastering complex control in MOBA games with deep reinforcement learning. In AAAI.
[40]
Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. 2014. From image descriptions to visual denotations: New Similarity metrics for semantic inference over event descriptions. Trans. Assoc. Comput. Ling. 2 (2014), 67–78.
[41]
X. Yuan, P. He, Q. Zhu, and X. Li. 2019. Adversarial examples: Attacks and defenses for deep learning. IEEE Trans. Neural Netw. Learn. Syst. 30, 9 (2019), 2805–2824.
[42]
S. Zhang, Z. Wang, X. Xu, X. Guan, and Y. Yang. 2020. Fooled by imagination: Adversarial attack to image captioning via perturbation in complex domain. In ICME. 1–6.

Cited By

View all
  • (2024)Exploring Visual Relationships via Transformer-based Graphs for Enhanced Image CaptioningACM Transactions on Multimedia Computing, Communications, and Applications10.1145/363855820:5(1-23)Online publication date: 22-Jan-2024
  • (2024)Cross-modality interaction reasoning for enhancing vision-language pre-training in image-text retrievalApplied Intelligence10.1007/s10489-024-05823-154:23(12230-12245)Online publication date: 11-Sep-2024
  • (2023)Zero-shot Scene Graph Generation via Triplet Calibration and ReductionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/360428420:1(1-21)Online publication date: 8-Jun-2023
  • Show More Cited By

Index Terms

  1. Learning Transferable Perturbations for Image Captioning

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Multimedia Computing, Communications, and Applications
    ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 18, Issue 2
    May 2022
    494 pages
    ISSN:1551-6857
    EISSN:1551-6865
    DOI:10.1145/3505207
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 16 February 2022
    Accepted: 01 July 2021
    Revised: 01 June 2021
    Received: 01 January 2021
    Published in TOMM Volume 18, Issue 2

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Adversarial examples
    2. image captioning
    3. robustness of neural network

    Qualifiers

    • Research-article
    • Refereed

    Funding Sources

    • National Natural Science Foundation of China
    • Guangdong International Science and Technology Cooperation Project
    • Guangdong Natural Science Foundation
    • Guangzhou Basic and Applied Research Project
    • CCF-Tencent Open Research fund

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)137
    • Downloads (Last 6 weeks)10
    Reflects downloads up to 15 Oct 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Exploring Visual Relationships via Transformer-based Graphs for Enhanced Image CaptioningACM Transactions on Multimedia Computing, Communications, and Applications10.1145/363855820:5(1-23)Online publication date: 22-Jan-2024
    • (2024)Cross-modality interaction reasoning for enhancing vision-language pre-training in image-text retrievalApplied Intelligence10.1007/s10489-024-05823-154:23(12230-12245)Online publication date: 11-Sep-2024
    • (2023)Zero-shot Scene Graph Generation via Triplet Calibration and ReductionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/360428420:1(1-21)Online publication date: 8-Jun-2023
    • (2023)Complementary Coarse-to-Fine Matching for Video Object SegmentationACM Transactions on Multimedia Computing, Communications, and Applications10.1145/359649619:6(1-21)Online publication date: 12-Jul-2023
    • (2023)Semantic Enhanced Video Captioning with Multi-feature FusionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/358857219:6(1-21)Online publication date: 20-Mar-2023
    • (2023)Video Captioning by Learning from Global Sentence and Looking AheadACM Transactions on Multimedia Computing, Communications, and Applications10.1145/358725219:5s(1-20)Online publication date: 7-Jun-2023
    • (2023)A2SC: Adversarial Attacks on Subspace ClusteringACM Transactions on Multimedia Computing, Communications, and Applications10.1145/358709719:6(1-23)Online publication date: 12-Jul-2023
    • (2023)Robust Video Stabilization based on Motion DecompositionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/358049819:5(1-24)Online publication date: 16-Mar-2023
    • (2023)NumCap: A Number-controlled Multi-caption Image Captioning NetworkACM Transactions on Multimedia Computing, Communications, and Applications10.1145/357692719:4(1-24)Online publication date: 27-Feb-2023
    • (2023)LSTAloc: A Driver-Oriented Incentive Mechanism for Mobility-on-Demand Vehicular Crowdsensing MarketIEEE Transactions on Mobile Computing10.1109/TMC.2023.327167123:4(3106-3122)Online publication date: 1-May-2023
    • Show More Cited By

    View Options

    Get Access

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    Full Text

    HTML Format

    View this article in HTML Format.

    HTML Format

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media