Backdoor Attacks with Input-Unique Triggers in NLP

Zhou, Xukun; Li, Jiwei; Zhang, Tianwei; Lyu, Lingjuan; Yang, Muqiao; He, Jun

doi:10.1007/978-3-031-70341-6_18

Xukun Zhou ORCID: orcid.org/0009-0005-5478-1791¹³,
Jiwei Li¹⁴,
Tianwei Zhang¹⁵,
Lingjuan Lyu¹⁶,
Muqiao Yang ORCID: orcid.org/0000-0001-6273-0138¹⁷ &
…
Jun He ORCID: orcid.org/0000-0003-1511-7554¹³

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14941))

Included in the following conference series:

Joint European Conference on Machine Learning and Knowledge Discovery in Databases

974 Accesses

Abstract

Backdoor attack aims to induce neural models to make incorrect predictions for poison data while keeping predictions on the clean dataset unchanged, which creates a considerable threat to current natural language processing (NLP) systems. Existing backdoor attacking systems face two severe issues: firstly, most backdoor triggers follow a uniform and usually input-independent pattern, e.g., insertion of specific trigger words. This significantly hinders the stealthiness of the attacking model, leading to the trained backdoor model being easily identified as malicious by model probes. Secondly, trigger-inserted poisoned sentences are usually disfluent, ungrammatical, or even change the semantic meaning from the original sentence. To resolve these two issues, we propose a method named NURA, where we generate backdoor triggers unique to inputs. NURA generates context-related triggers by continuing to write the input with a language model like GPT2 [2]. The generated sentence is used as the backdoor trigger. This strategy not only creates input-unique backdoor triggers but also preserves the semantics of the original input, simultaneously resolving the two issues above. Experimental results show that the NURA attack is effective for attack and difficult to defend against: it achieves a high attack success rate across all the widely applied benchmarks while being immune to existing defense methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 139.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Kallima: A Clean-Label Framework for Textual Backdoor Attacks

UTPrompt: Cross-Task Backdoor Prompt Attacks Based on Universal Triggers

Punctuation Matters! Stealthy Backdoor Attack for Language Models

References

Akhtar, N., Mian, A.: Threat of adversarial attacks on deep learning in computer vision: a survey. Ieee Access 6, 14410–14430 (2018)
Article Google Scholar
Brown, T., et al.: Language models are few-shot learners. In: NIPS, vol. 33, pp. 1877–1901 (2020)
Google Scholar
Cer, D., et al.: Universal sentence encoder. arXiv preprint arXiv:1803.11175 (2018)
Chen, C., Dai, J.: Mitigating backdoor attacks in LSTM-based text classification systems by backdoor keyword identification. Neurocomputing 452, 253–262 (2021)
Article Google Scholar
Chen, K., et al.: BadPre: task-agnostic backdoor attacks to pre-trained NLP foundation models. arXiv preprint arXiv:2110.02467 (2021)
Chen, X., et al.: BadNL: Backdoor attacks against NLP models with semantic-preserving improvements. In: Annual Computer Security Applications Conference, pp. 554–569 (2021)
Google Scholar
Chen, X., Liu, C., Li, B., Lu, K., Song, D.: Targeted backdoor attacks on deep learning systems using data poisoning. arXiv preprint arXiv:1712.05526 (2017)
Cui, G., Yuan, L., He, B., Chen, Y., Liu, Z., Sun, M.: A unified evaluation of textual backdoor learning: Frameworks and benchmarks. NIPS 35, 5009–5023 (2022)
Google Scholar
Davidson, T., Warmsley, D., Macy, M., Weber, I.: Automated hate speech detection and the problem of offensive language. In: Proceedings of the International AAAI Conference on Web and Social Media, vol. 11, pp. 512–515 (2017)
Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the NAACL, Volume 1 (Long and Short Papers), pp. 4171–4186 (2019)
Google Scholar
Doan, K., Lao, Y., Zhao, W., Li, P.: LIRA: learnable, imperceptible and robust backdoor attacks. In: ICCV, pp. 11966–11976 (2021)
Google Scholar
Fan, C., et al.: Defending against backdoor attacks in natural language generation. arXiv e-prints, pp. arXiv–2106 (2021)
Google Scholar
Gan, L., Li, J., Zhang, T., Li, X., Meng, Y., Wu, F., Guo, S., Fan, C.: Triggerless backdoor attack for nlp tasks with clean labels. arXiv preprint arXiv:2111.07970 (2021)
Gao, Y., et al.: Design and evaluation of a multi-domain trojan detection method on deep neural networks. IEEE Trans. Dependable Secure Comput. 19(4), 2349–2364 (2021)
Article Google Scholar
Gehring, J., Auli, M., Grangier, D., Yarats, D., Dauphin, Y.N.: Convolutional sequence to sequence learning. In: ICML, pp. 1243–1252. PMLR (2017)
Google Scholar
Goodfellow, I.J., Shlens, J., Szegedy, C.: Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572 (2014)
Gu, T., Dolan-Gavitt, B., Garg, S.: BadNets: identifying vulnerabilities in the machine learning model supply chain. arXiv e-prints pp. arXiv–1708 (2017)
Google Scholar
Guo, S., Xie, C., Li, J., Lyu, L., Zhang, T.: Threats to pre-trained language models: survey and taxonomy. arXiv preprint arXiv:2202.06862 (2022)
Jang, E., Gu, S., Poole, B.: Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144 (2016)
Jiang, L., Yu, M., Zhou, M., Liu, X., Zhao, T.: Target-dependent twitter sentiment classification. In: ACL, pp. 151–160 (2011)
Google Scholar
Jin, D., Jin, Z., Zhou, J.T., Szolovits, P.: Is BERT really robust? A strong baseline for natural language attack on text classification and entailment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 8018–8025 (2020)
Google Scholar
Kim, B., Rudin, C., Shah, J.: The Bayesian case model: a generative approach for case-based reasoning and prototype classification. In: Proceedings of the 27th NIPS, vol. 2, pp. 1952–1960. NIPS 2014, MIT Press, Cambridge, MA, USA (2014)
Google Scholar
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: ICLR (Poster) (2015)
Google Scholar
Kiros, R., et al.: Skip-thought vectors. In: NIPS, vol. 28 (2015)
Google Scholar
Koh, P.W., Liang, P.: Understanding black-box predictions via influence functions. In: ICML, pp. 1885–1894. PMLR (2017)
Google Scholar
Kurita, K., Michel, P., Neubig, G.: Weight poisoning attacks on pretrained models. In: ACL, pp. 2793–2806 (2020)
Google Scholar
Kwon, H.: Friend-guard textfooler attack on text classification system. IEEE Access, 1–1 (2021)
Google Scholar
Li, L., Song, D., Li, X., Zeng, J., Ma, R., Qiu, X.: Backdoor attacks on pre-trained models by layerwise weight poisoning. In: EMNLP, pp. 3023–3032 (2021)
Google Scholar
Li, S., Xue, M., Zhao, B.Z.H., Zhu, H., Zhang, X.: Invisible backdoor attacks on deep neural networks via steganography and regularization. IEEE Trans. Dependable Secure Comput. 18(5), 2088–2105 (2020)
Google Scholar
Li, Y., Li, Y., Wu, B., Li, L., He, R., Lyu, S.: Invisible backdoor attack with sample-specific triggers. In: ICCV, pp. 16463–16472 (2021)
Google Scholar
Liao, C., Zhong, H., Squicciarini, A., Zhu, S., Miller, D.: Backdoor embedding in convolutional neural network models via invisible perturbation. arXiv preprint arXiv:1808.10307 (2018)
Nasar, Z., Jaffry, S.W., Malik, M.K.: Named entity recognition and relation extraction: state-of-the-art. ACM Comput. Surv. (CSUR) 54(1), 1–39 (2021)
Article Google Scholar
Nguyen, T.A., Tran, A.: Input-aware dynamic backdoor attack. In: NIPS, vol. 33, pp. 3454–3464 (2020)
Google Scholar
Nguyen, T.A., Tran, A.T.: WaNet - imperceptible warping-based backdoor attack. In: International Conference on Learning Representations (2021)
Google Scholar
Ning, R., Li, J., Xin, C., Wu, H.: Invisible poison: a blackbox clean label backdoor attack to deep neural networks. In: IEEE INFOCOM 2021-IEEE Conference on Computer Communications, pp. 1–10. IEEE (2021)
Google Scholar
Ohana, B., Tierney, B.: Sentiment classification of reviews using SentiWordNet. In: Proceedings of IT &T, vol. 8 (2009)
Google Scholar
Qi, F., Chen, Y., Li, M., Yao, Y., Liu, Z., Sun, M.: Onion: a simple and effective defense against textual backdoor attacks. arXiv preprint arXiv:2011.10369 (2020)
Qi, F., Chen, Y., Zhang, X., Li, M., Liu, Z., Sun, M.: Mind the style of text! adversarial and backdoor attacks based on text style transfer. In: EMNLP, pp. 4569–4580 (2021)
Google Scholar
Qi, F., et al.: Hidden Killer: invisible textual backdoor attacks with syntactic trigger. In: Proceedings of the 59th ACL, pp. 443–453 (2021)
Google Scholar
Qi, F., Yao, Y., Xu, S., Liu, Z., Sun, M.: Turn the combination lock: Learnable textual backdoor attacks via word substitution. In: Proceedings of the 59th Annual Meeting of ACL, pp. 4873–4883 (2021)
Google Scholar
Qi, X., Xie, T., Pan, R., Zhu, J., Yang, Y., Bu, K.: Towards practical deployment-stage backdoor attack on deep neural networks. In: CVPR (2022)
Google Scholar
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Language models are unsupervised multitask learners. OpenAI blog 1(8), 9 (2019)
Google Scholar
Raffel, C., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 21(140), 1–67 (2020)
MathSciNet Google Scholar
Sarkar, E., Benkraouda, H., Maniatakos, M.: FaceHack: triggering backdoored facial recognition systems using facial characteristics. arXiv preprint arXiv:2006.11623 (2020)
Shao, K., Zhang, Y., Yang, J., Liu, H.: Textual backdoor defense via poisoned sample recognition. Appl. Sci. 11(21) (2021). https://doi.org/10.3390/app11219938
Socher, R., et al.: Recursive deep models for semantic compositionality over a sentiment treebank. In: EMNLP2023, pp. 1631–1642 (2013)
Google Scholar
Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: NIPS, vol. 27 (2014)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: NIPS, vol. 30 (2017)
Google Scholar
Wang, J., et al.: Putting words into the system’s mouth: a targeted attack on neural machine translation using monolingual data poisoning. In: ACL-IJCNLP 2021, pp. 1463–1473 (2021)
Google Scholar
Xiang, Z., Miller, D.J., Chen, S., Li, X., Kesidis, G.: A backdoor attack against 3d point cloud classifiers. In: ICCV, pp. 7597–7607 (2021)
Google Scholar
Yang, W., Lin, Y., Li, P., Zhou, J., Sun, X.: Rap: Robustness-aware perturbations for defending against backdoor attacks on NLP models. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 8365–8381 (2021)
Google Scholar
Yang, W., Lin, Y., Li, P., Zhou, J., Sun, X.: Rethinking stealthiness of backdoor attack against NLP models. In: ACL, pp. 5543–5557 (2021)
Google Scholar
Zhang, X., Zhao, J., LeCun, Y.: Character-level convolutional networks for text classification. In: NIPS, vol. 28 (2015)
Google Scholar
Zhang, Z., Lyu, L., Wang, W., Sun, L., Sun, X.: How to inject backdoors with better consistency: logit anchoring on clean data. In: International Conference on Learning Representations (2021)
Google Scholar

Download references

Acknowledgements

This work was supported by National Natural Science Foundation of China (NSFC) under Grant Nos. 62072459 and 62172421.

Author information

Authors and Affiliations

Renmin University of China, 59 Zhongguancun Street, Haidian District, Beijing, China
Xukun Zhou & Jun He
ShannonAI,Room 3013, Block B, Beifa Building, 15 Xueyuan South Road, Haidian District, Beijing, China
Jiwei Li
Nanyang Technological University, 50 Nanyang Avenue, Singapore, 639798, Singapore
Tianwei Zhang
ARK Mori Building 3F, 1-12-32 Akasaka, Minato-ku, Tokyo, Japan
Lingjuan Lyu
Carnegie Mellon University, 5000 Forbes Avenue, Pittsburgh, PA, 15213, USA
Muqiao Yang

Authors

Xukun Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Jiwei Li
View author publications
You can also search for this author in PubMed Google Scholar
Tianwei Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Lingjuan Lyu
View author publications
You can also search for this author in PubMed Google Scholar
Muqiao Yang
View author publications
You can also search for this author in PubMed Google Scholar
Jun He
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jun He .

Editor information

Editors and Affiliations

LTCI, Télécom Paris, Palaiseau Cedex, France
Albert Bifet
KU Leuven, Leuven, Belgium
Jesse Davis
Faculty of Informatics, Vytautas Magnus University, Akademija, Lithuania
Tomas Krilavičius
Institute of Computer Science, University of Tartu, Tartu, Estonia
Meelis Kull
Department of Computer Science, Bundeswehr University Munich, Munich, Germany
Eirini Ntoutsi
Department of Computer Science, University of Helsinki, Helsinki, Finland
Indrė Žliobaitė

Ethics declarations

Ethical Declarements

Backdoor attacks pose a major risk to natural language processing by subtly manipulating model inferences. While existing defenses examine syntactic correctness and repetition, we propose a fluency-preserving perturbation method, named NURA, to clandestinely poison language models during generation rather than post-hoc. By subtly altering inputs, our approach evades rule-based detection while producing fluent poisoned texts. Through this work, we aim to raise awareness of stealthy input-aware backdoors and spur discussion on mitigation, as adversarial examples integrated during training challenge standard defenses and model auditing. Continued exploration of techniques detecting pattern shifts introduced during poisoning may help safeguard applications, emphasizing proactive consideration of diverse attack vectors throughout development to strengthen protections for real-world language systems.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhou, X., Li, J., Zhang, T., Lyu, L., Yang, M., He, J. (2024). Backdoor Attacks with Input-Unique Triggers in NLP. In: Bifet, A., Davis, J., Krilavičius, T., Kull, M., Ntoutsi, E., Žliobaitė, I. (eds) Machine Learning and Knowledge Discovery in Databases. Research Track. ECML PKDD 2024. Lecture Notes in Computer Science(), vol 14941. Springer, Cham. https://doi.org/10.1007/978-3-031-70341-6_18

Download citation

DOI: https://doi.org/10.1007/978-3-031-70341-6_18
Published: 22 August 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-70340-9
Online ISBN: 978-3-031-70341-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

the ECML PKDD community (opens in a new tab)

Backdoor Attacks with Input-Unique Triggers in NLP