research-article

Open access

Semantic Stealth: Crafting Covert Adversarial Patches for Sentiment Classifiers Using Large Language Models

Authors:

Sudarshan Srinivasan,

Amir SadovnikAuthors Info & Claims

AISec '24: Proceedings of the 2024 Workshop on Artificial Intelligence and Security

Pages 42 - 52

https://doi.org/10.1145/3689932.3694758

Published: 22 November 2024 Publication History

Abstract

Deep learning models have been shown to be vulnerable to adversarial attacks, in which perturbations to their inputs cause the model to produce incorrect predictions. As opposed to adversarial attacks in computer vision, where small changes introduced to pixel values can drastically alter a model's output while remaining imperceptible to humans, text-based attacks are difficult to conceal due to the discrete nature of tokens. Consequently, unconstrained gradient-based attacks often produce adversarial examples that lack semantic meaning, rendering them detectable through visual inspection or perplexity filters. In contrast to methods that rely on gradient-based optimization in the embedding space, we propose an approach that leverages a Large Language Model's ability to generate grammatically correct and semantically meaningful text to craft adversarial patches that seamlessly blend in with the original input text. These patches can be used to alter the behavior of a target model, such as a text classifier. Since our approach does not rely on gradient backpropagation, it only requires access to the target model's confidence scores, making it a grey-box attack. We demonstrate the feasibility of our approach using open-source LLMs, including Intel's Neural Chat, Llama2, and Mistral-Instruct, to generate adversarial patches capable of altering the predictions of a distilBERT model fine-tuned on the IMDB reviews dataset for sentiment classification.

References

[1]

Naveed Akhtar, Ajmal Mian, Navid Kardan, and Mubarak Shah. 2021. Advances in Adversarial Attacks and Defenses in Computer Vision: A Survey. IEEE Access, Vol. 9 (2021), 155161--155196. https://doi.org/10.1109/ACCESS.2021.3127960

[2]

Moustafa Alzantot, Yash Sharma, Ahmed Elgohary, Bo-Jhang Ho, Mani Srivastava, and Kai-Wei Chang. 2018. Generating Natural Language Adversarial Examples. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun'ichi Tsujii (Eds.). Association for Computational Linguistics, Brussels, Belgium, 2890--2896. https://doi.org/10.18653/v1/D18-1316

[3]

Tao Bai, Jinqi Luo, and Jun Zhao. 2022. Inconspicuous Adversarial Patches for Fooling Image-Recognition Systems on Mobile Devices. IEEE Internet of Things Journal, Vol. 9, 12 (June 2022), 9515--9524. https://doi.org/10.1109/JIOT.2021.3124815 Conference Name: IEEE Internet of Things Journal.

[4]

Yonatan Belinkov and Yonatan Bisk. 2018. Synthetic and Natural Noise Both Break Neural Machine Translation. https://doi.org/10.48550/arXiv.1711.02173 arXiv:1711.02173 [cs].

[5]

Tom B. Brown, Dandelion Mané, Aurko Roy, Martín Abadi, and Justin Gilmer. 2018. Adversarial Patch. https://doi.org/10.48550/arXiv.1712.09665 arXiv:1712.09665 [cs].

[6]

Yong Cheng, Lu Jiang, and Wolfgang Macherey. 2019. Robust Neural Machine Translation with Doubly Adversarial Inputs. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Anna Korhonen, David Traum, and Lluís Màrquez (Eds.). Association for Computational Linguistics, Florence, Italy, 4324--4333. https://doi.org/10.18653/v1/P19-1425

[7]

Aran Chindaudom, Prarinya Siritanawan, Karin Sumongkayothin, and Kazunori Kotani. 2022. Surreptitious Adversarial Examples through Functioning QR Code. Journal of Imaging, Vol. 8, 5 (May 2022), 122. https://doi.org/10.3390/jimaging8050122 Number: 5 Publisher: Multidisciplinary Digital Publishing Institute.

[8]

Ranjie Duan, Xingjun Ma, Yisen Wang, James Bailey, A. K. Qin, and Yun Yang. 2020. Adversarial Camouflage: Hiding Physical-World Attacks with Natural Styles. https://doi.org/10.48550/arXiv.2003.08757 arXiv:2003.08757 [cs].

[9]

Javid Ebrahimi, Anyi Rao, Daniel Lowd, and Dejing Dou. 2018. HotFlip: White-Box Adversarial Examples for Text Classification. https://doi.org/10.48550/arXiv.1712.06751 arXiv:1712.06751 [cs].

[10]

Ji Gao, Jack Lanchantin, Mary Lou Soffa, and Yanjun Qi. 2018. Black-Box Generation of Adversarial Text Sequences to Evade Deep Learning Classifiers. In 2018 IEEE Security and Privacy Workshops (SPW). 50--56. https://doi.org/10.1109/SPW.2018.00016

[11]

Ian J. Goodfellow, Jonathon Shlens, and Christian Szegedy. 2015. Explaining and Harnessing Adversarial Examples. https://doi.org/10.48550/arXiv.1412.6572 arXiv:1412.6572 [cs, stat].

[12]

Shreya Goyal, Sumanth Doddapaneni, Mitesh M. Khapra, and Balaraman Ravindran. 2023. A Survey of Adversarial Defenses and Robustness in NLP. Comput. Surveys, Vol. 55, 14s (Dec. 2023), 1--39. https://doi.org/10.1145/3593042

Digital Library

[13]

Hossein Hosseini, Sreeram Kannan, Baosen Zhang, and Radha Poovendran. 2017. Deceiving Google's Perspective API Built for Detecting Toxic Comments. https://doi.org/10.48550/arXiv.1702.08138 arXiv:1702.08138 [cs].

[14]

C. Hutto and Eric Gilbert. 2014. VADER: A Parsimonious Rule-Based Model for Sentiment Analysis of Social Media Text. Proceedings of the International AAAI Conference on Web and Social Media, Vol. 8, 1 (May 2014), 216--225. https://doi.org/10.1609/icwsm.v8i1.14550

[15]

Neel Jain, Avi Schwarzschild, Yuxin Wen, Gowthami Somepalli, John Kirchenbauer, Ping-yeh Chiang, Micah Goldblum, Aniruddha Saha, Jonas Geiping, and Tom Goldstein. 2023. Baseline Defenses for Adversarial Attacks Against Aligned Language Models. https://doi.org/10.48550/arXiv.2309.00614 arXiv:2309.00614 [cs].

[16]

Robin Jia and Percy Liang. 2017. Adversarial Examples for Evaluating Reading Comprehension Systems. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Martha Palmer, Rebecca Hwa, and Sebastian Riedel (Eds.). Association for Computational Linguistics, Copenhagen, Denmark, 2021--2031. https://doi.org/10.18653/v1/D17-1215

[17]

Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. 2023. Mistral 7B. arxiv: 2310.06825 [cs.CL]

[18]

Di Jin, Zhijing Jin, Joey Tianyi Zhou, and Peter Szolovits. 2019. Is BERT Really Robust? Natural Language Attack on Text Classification and Entailment. CoRR, Vol. abs/1907.11932 (2019). [arXiv]1907.11932 http://arxiv.org/abs/1907.11932

[19]

Jean Kaddour, Joshua Harris, Maximilian Mozes, Herbie Bradley, Roberta Raileanu, and Robert McHardy. 2023. Challenges and Applications of Large Language Models. https://doi.org/10.48550/arXiv.2307.10169 arXiv:2307.10169 [cs].

[20]

Alexey Kurakin, Ian Goodfellow, and Samy Bengio. 2017. Adversarial Machine Learning at Scale. https://doi.org/10.48550/arXiv.1611.01236 arXiv:1611.01236 [cs, stat].

[21]

Jinfeng Li, Shouling Ji, Tianyu Du, Bo Li, and Ting Wang. 2018. TextBugger: Generating Adversarial Text Against Real-world Applications. CoRR, Vol. abs/1812.05271 (2018). [arXiv]1812.05271 http://arxiv.org/abs/1812.05271

[22]

Kaokao Lv, Liang Lv, Chang Wang, Wenxin Zhang, Xuhui Ren, and Haihao Shen. 2023. Neural-Chat-v3-3. https://huggingface.co/Intel/neural-chat-7b-v3-3 [Accessed: 2024-05-24].

[23]

Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. 2011. Learning Word Vectors for Sentiment Analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, Portland, Oregon, USA, 142--150. http://www.aclweb.org/anthology/P11-1015

Digital Library

[24]

Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. 2019. Towards Deep Learning Models Resistant to Adversarial Attacks. https://doi.org/10.48550/arXiv.1706.06083 arXiv:1706.06083 [cs, stat].

[25]

John X. Morris, Eli Lifland, Jin Yong Yoo, Jake Grigsby, Di Jin, and Yanjun Qi. 2020. TextAttack: A Framework for Adversarial Attacks, Data Augmentation, and Adversarial Training in NLP. https://doi.org/10.48550/arXiv.2005.05909 arXiv:2005.05909 [cs].

[26]

Kenton Murray and David Chiang. 2018. Correcting Length Bias in Neural Machine Translation. arxiv: 1808.10006 [cs.CL]

[27]

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. 2022. Training language models to follow instructions with human feedback. arxiv: 2203.02155 [cs.CL]

[28]

Bo Pang and Lillian Lee. 2005. Seeing Stars: Exploiting Class Relationships for Sentiment Categorization with Respect to Rating Scales. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL'05). Association for Computational Linguistics, Ann Arbor, Michigan, 115--124. https://doi.org/10.3115/1219840.1219855

Digital Library

[29]

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. 2019. Language models are unsupervised multitask learners. OpenAI blog, Vol. 1, 8 (2019), 9.

[30]

Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2019. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arxiv: 1910.01108 [cs.CL]

[31]

Mahmood Sharif, Sruti Bhagavatula, Lujo Bauer, and Michael K. Reiter. 2019. A General Framework for Adversarial Examples with Objectives. ACM Transactions on Privacy and Security, Vol. 22, 3 (Aug. 2019), 1--30. https://doi.org/10.1145/3317611

Digital Library

[32]

Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. 2014. Intriguing properties of neural networks. https://doi.org/10.48550/arXiv.1312.6199 arXiv:1312.6199 [cs].

[33]

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. 2023. Llama 2: Open Foundation and Fine-Tuned Chat Models. arxiv: 2307.09288 [cs.CL]

[34]

Yi-Ting Tsai, Min-Chu Yang, and Han-Yu Chen. 2019. Adversarial Attack on Sentiment Classification. In Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, Tal Linzen, Grzegorz Chrupa?a, Yonatan Belinkov, and Dieuwke Hupkes (Eds.). Association for Computational Linguistics, Florence, Italy, 233--240. https://doi.org/10.18653/v1/W19-4824

[35]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Advances in Neural Information Processing Systems, I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30. Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf

Digital Library

[36]

Chris Wise and Jo Plested. 2022. Developing Imperceptible Adversarial Patches to Camouflage Military Assets From Computer Vision Enabled Technologies. https://doi.org/10.48550/arXiv.2202.08892 arXiv:2202.08892 [cs].

[37]

Hiromu Yakura, Youhei Akimoto, and Jun Sakuma. 2019. Generate (non-software) Bugs to Fool Classifiers. https://doi.org/10.48550/arXiv.1911.08644 arXiv:1911.08644 [cs, stat].

[38]

Andy Zou, Zifan Wang, J. Zico Kolter, and Matt Fredrikson. 2023. Universal and Transferable Adversarial Attacks on Aligned Language Models. https://doi.org/10.48550/arXiv.2307.15043 arXiv:2307.15043 [cs].

Index Terms

Semantic Stealth: Crafting Covert Adversarial Patches for Sentiment Classifiers Using Large Language Models
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
2. Security and privacy
  1. Human and societal aspects of security and privacy

Recommendations

Defending against Adversarial Patches using Dimensionality Reduction
DAC '24: Proceedings of the 61st ACM/IEEE Design Automation Conference

Adversarial patch-based attacks have shown to be a major deterrent towards the reliable use of machine learning models. These attacks involve the strategic modification of localized patches or specific image areas to deceive trained machine learning ...
Adversarial Patches-based Attacks on Automated Vehicle Make and Model Recognition Systems
Q2SWinet '20: Proceedings of the 16th ACM Symposium on QoS and Security for Wireless and Mobile Networks

The recent works on automated vehicle make and model recognition (VMMR) have embraced the use of advanced deep learning models such as convolutional neural networks. In this work, we introduce an adversarial attack against such VMMR systems through ...
An information-theoretic perspective of physical adversarial patches
Abstract
Real-world adversarial patches were shown to be successful in compromising state-of-the-art models in various computer vision applications. Most existing defenses rely on analyzing input or feature level gradients to detect the patch. However, ...
Highlights
- An adversarial patch defense (Jedi) that uses entropy to locate and remove adversarial patches.
- Jedi’s precise patch localization leads to state-of-the-art performance.
- Jedi is robust against adaptive entropy aware attacks as well ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

AISec '24: Proceedings of the 2024 Workshop on Artificial Intelligence and Security

November 2024

225 pages

ISBN:9798400712289

DOI:10.1145/3689932

Program Chairs:
Maura Pintor
University of Cagliari
,
Xinyun Chen
Google DeepMind
,
Matthew Jagielski
Google DeepMind

Copyright © 2024 Owner/Author.

This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

SIGSAC: ACM Special Interest Group on Security, Audit, and Control

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 November 2024

Check for updates

Author Tags

Qualifiers

Research-article

Conference

CCS '24

Sponsor:

SIGSAC

CCS '24: ACM SIGSAC Conference on Computer and Communications Security

October 14 - 18, 2024

UT, Salt Lake City, USA

Acceptance Rates

Overall Acceptance Rate 94 of 231 submissions, 41%

Upcoming Conference

CCS '25

Sponsor:
sigsac

ACM SIGSAC Conference on Computer and Communications Security

October 13 - 17, 2025

Taipei , Taiwan

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
74
Total Downloads

Downloads (Last 12 months)74
Downloads (Last 6 weeks)29

Reflects downloads up to 09 Feb 2025

Other Metrics

View Author Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Figures

Tables

Media

View Table of Conten