Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Mellivora Capensis: A Backdoor-Free Training Framework on the Poisoned Dataset without Auxiliary Data

Yuwen Pu, Jiahao Chen, Chunyi Zhou, Zhou Feng, Qingming Li, Chunqiang Hu, Shouling Ji Yuwen Pu, Jiahao Chen, Chunyi Zhou, Zhou Feng, Qingming Li and Shouling Ji are with the College of Computer Science and Technology at Zhejiang University, Hangzhou, Zhejiang, 310027, China. E-mail: {yw.pu, xaddwell, zhouchunyi, zhou.feng, liqm, sji}@zju.edu.cnChunqiang Hu is with the School of Big Data & Software Engineering, Chongqing University, Chongqing, 400030, China, E-mail: chu@cqu.edu.cn.Yuwen Pu and Jiahao Chen are the co-first authors.
Abstract

The efficacy of deep learning models is profoundly influenced by the quality of their training data. Given the considerations of data diversity, data scale, and annotation expenses, model trainers frequently resort to sourcing and acquiring datasets from online repositories. Although economically pragmatic, this strategy exposes the models to substantial security vulnerabilities. Untrusted entities can clandestinely embed triggers within the dataset, facilitating the hijacking of the trained model on the poisoned dataset through backdoor attacks, which constitutes a grave security concern. Despite the proliferation of countermeasure research, their inherent limitations constrain their effectiveness in practical applications. These include the requirement for substantial quantities of clean samples, inconsistent defense performance across varying attack scenarios, and inadequate resilience against adaptive attacks, among others. Therefore, in this paper, we endeavor to address the challenges of backdoor attack countermeasures in real-world scenarios, thereby fortifying the security of training paradigm under the data-collection manner. Concretely, we first explore the inherent relationship between the potential perturbations and the backdoor trigger, and demonstrate the key observation that the poisoned samples perform more robustness to perturbation than the clean ones through the theoretical analysis and experiments. Then, based on our key explorations, we propose a robust and clean-data-free backdoor defense framework, namely Mellivora Capensis (MeCa), which enables the model trainer to train a clean model on the poisoned dataset. MeCa can effectively identify poisoned samples and train a clean model on the poisoned dataset and eliminate the requirement of clean samples or knowledge about the poisoned dataset (e.g., poisoning ratio). We conduct extensive experiments in defending against 8888 SOTA attacks (including 3333 adaptive attacks) on 4444 datasets (Imagnette, Tiny ImageNet, CIFAR10, and CIFAR100). The experimental results reveals that MeCa can achieve an average ASR with almost 0.00%percent0.000.00\%0.00 % to defend against SOTA backdoor attacks while maintaining model availability, which outperforms 7777 SOTA backdoor defense methods. Furthermore, the excellent performance on 3333 different model architectures and poisoning ratios also highlights the remarkable generalization capability of MeCa.

Index Terms:
backdoor attack, backdoor defense, adversarial perturbation, model security.

1 Introduction

Training deep learning models relies on a large amount of training data [1, 2, 3, 4]. However, the training data are usually collected from the web or purchased from an untrustworthy third party, which leads to a risk of being poisoned or unreliable [5, 6]. For example, a company that wants to train a facial recognition model is likely to obtain poisoned data from a third party who aims to hijack the facial recognition model for profit [7, 8], as depicted in Fig.1. Therefore, the unreliable training dataset may bring many serious security threats to the model training side, which breaches the national regulations, such as Artificial Intelligence Act (AIA) [9], Digital Services Act (DSA) [10]. One of the typical threats is the backdoor attack, which is usually conducted by poisoning the training data [11, 12]. In the backdoor attack, the attacker manipulates a few training samples by adding a specific trigger (e.g., a pixel patch or blending) and modifies the labels of these samples as a target class. The backdoor model maintains high accuracy on the normal samples but predicts the target class whenever the trigger pattern is attached to a test sample, as depicted in Fig.1. Backdoor attacks bring a significant security risk for model training owners as they are easily conducted and allow adversaries to control the trained model stealthily [13, 14]. For example, a backdoored traffic sign recognition model may misclassify a stop sign with the trigger patch as a no-tooting sign, which may cause a serious traffic accident. Therefore, untrusted sources of datasets often serve as a breeding ground for backdoor attacks, posing a significant security threat to model trainers.

Regarding the third-party poisoned training dataset, both academia and industry have conducted extensive research, such as the detection and purification of backdoor samples in poisoned datasets [15, 16]. Although effective backdoor detection and purification are valuable, economically speaking, it is clearly more cost-effective and practical to train a clean model directly on a poisoned dataset if that were possible. Such paradigm simplifies data prep and reduces cleaning costs by enabling training on potentially tainted datasets, fostering robust models without extensive preprocessing. It’s crucial in resource-intensive data scenarios and could redefine data handling across fields, ensuring reliable, cost-effective outcomes [17], which implies we possess the capability to rebuild amidst the ruins.

Refer to caption
Figure 1: Security threats on training the poisoned data.

Consequently, another research direction in backdoor defense is to train clean models on poisoned datasets [18, 19, 20, 21, 22]. However, the existing methods have some non-negligible limitations in practical applications. (1) Requirement for clean samples. Their defense effectiveness is highly dependent on the obtaining of a substantial number of clean samples [20, 21, 22], which is impractical for defenders in certain specific real-world scenarios [23]. (2) Unstable defense performance. Their actual defense performance varies significantly depending on the choice of backdoor attack methods and poisoning ratios [24, 19], which is usually unknown to defenders in advance in practical applications. (3) Inadequate defense against adaptive attacks. They may exhibit poor resilience against adaptive attacks [19], e.g., transforming some poisoned samples.

Therefore, in this context, we ask the following research questions: {mdframed}[backgroundcolor=black!10,rightline=false,leftline=false,topline=false,bottomline=false,roundcorner=2mm] Is it feasible to train a clean model directly on a poisoned dataset, without any knowledge of the attacker’s capabilities (such as poisoning ratio) and without any auxiliary clean dataset? If so, how to ensure model availability?

In this paper, we intend to overcome the shortcomings mentioned above and propose a more practical and robust backdoor defense approach. To achieve these goals, there are three challenges that we need to address:

  • How to identify the poisoned samples regardless of the poisoning ratio and without the auxiliary clean dataset?

  • How to defend against various backdoor attacks and have a stable performance against adaptive attacks?

  • How to achieve satisfactory defense performance while maintaining a high accuracy of the main task?

To address above challenges, we propose the Mellivora Capensis (named MeCa) framework, a robust and clean-data-free backdoor defense scheme based on adversarial perturbation, which enables the defender to train a clean model on a poisoned dataset. MeCa employs a three-step approach to achieve the objective. First, we delve into the fundamental relationship between perturbations and backdoors through comprehensive theoretical analysis and rigorous experimental validation. Then, we utilize adversarial perturbations to distinguish between clean samples and backdoor samples at a coarse granularity based on the above key observation about perturbations and backdoors. By unlearning identified clean samples and re-training on the remaining poisoned dataset, we refine a backdoor model that precisely distinguishes between poisoned and clean samples. Finally, we train a clean model on the identified pristine samples and enhance its performance through fine-tuning, incorporating relabeled poisoned samples into the clean dataset. The main contributions are threefold:

  • We explore the intrinsic differences between poisoned samples and clean samples, uncovering key observation about the differences in their perturbation manifestations. We conducted extensive theoretical analysis and experiments to substantiate our findings.

  • We propose a robust and clean-data-free backdoor defense framework, namely Mellivora Capensis (MeCa). MeCa allows defenders to train a clean model directly on poisoned data that maintains high accuracy on the primary task, while also being capable of resisting a variety of popular backdoor attacks and adaptive attacks. MeCa does not require any auxiliary clean data, offering a novel paradigm for backdoor defense and robust training.

  • We evaluate MeCa through extensive experiments on four datasets, which demonstrate a better backdoor defense performance compared to seven SOTA backdoor defense methods. Experimental results validate that MeCa could decrease the attack success rate (average ASR nearly 0.00%) while only incurring negligible accuracy loss. Furthermore, MeCa also demonstrates robust generalization across various model architectures and poisoning ratios.

2 Related Works

In this section, we review the relevant work on backdoor attacks and introduce the development of backdoor defense.

2.1 Backdoor Attack Methods

In recent years, many backdoor attack approaches have been proposed, including patch-based attacks, visibility of trigger attacks, label consistency, and so on. Gu et al.[12] proposed a typical backdoor attack method by adding a specific patch on the samples and modifying the corresponding label as the targeted label. This attack achieves high attack success and has a low-performance impact on the main task. Chen et al.[25] proposed a backdoor attack in which the trigger is blending background rather than a single pixel. This means that the trigger is hard for human beings to notice. Liu et al.[26] proposed a more stealthy backdoor attack that plants reflections as a trigger into the victim model. This attack can be resistant to many existing defense methods. Liu et al.[27] proposed a lightweight backdoor attack that just needs to inverse the neural network to generate a general trojan trigger and fine-tune some layers of the network to implant the trigger. However, all the above backdoor attacks require poisoning the label, which is easier to detect. Therefore, some researchers have also proposed clean-label backdoor attacks. For example, Shafahi et al.[28] proposed a targeted clean-label poisoning attack. This attack crafts poison images that collide with a target image in feature space, thus making it undistinguishable from a network. Because the attacker does not need to control the label, it is more stealthy to conduct a backdoor attack. Turner et al.[29] proposed a clean-label backdoor attack based on adversarial examples and GAN-generated data. The key feature of this attack is that the poisoned samples appear to be consistent with their label and thus seem clean even from human inspection. Chen et al.[8] proposed an Invisible Poisoning Attack (IPA), which is difficult to detect by existing defense methods. This attack not only employs highly stealthy poison training examples with the clean labels (perceptually similar to their clean samples), but also does not need to modify the labels. Li et al.[30] proposed two stealthy backdoor attacks in which the triggers are derived from the covert features. Compared with the existing backdoor attacks, the trigger patterns of this attack method are invisible to human eyes. Moreover, it is difficult to recover the backdoor trigger through the optimization algorithm. Zhu et al.[31] proposed a transferable clean-label poisoning attack in which poison samples are fabricated to surround the targeted sample in feature space. Saha et al.[32] proposed a novel backdoor attack in which the poisoned samples are similar to the clean samples with the correct labels. [33] and [34] proposed defense-resistant backdoor attacks in an outsourced cloud environment.

2.2 Backdoor Defense Methods

Many defense methods are proposed to resist the existing backdoor attacks. Some approaches require that the defender must own some reserved clean datasets. For example, Qi et al.[20] proposed a backdoor sample detection method that directly enforces and magnifies distinctive characteristics of the post-attacked model to facilitate poison detection. Guo et al.[21] proposed a universal detection approach based on clustering and centroids analysis. The approach can detect the poisoned samples based on density-based clustering and the clean validation dataset. Zhu et al.[35] proposed a defense method by inserting a learnable neural polarizer layer, which is optimized based on a limited clean dataset. The layer can purify the poisoned sample by filtering trigger information while maintaining clean information. For a backdoored model, Chen et al.[22] proposed a generic scheme for defending against backdoor attacks. The insight of this scheme is to localize the neuron set related to the trigger with the auxiliary clean dataset and suppress the compromised neurons. Ma et al.[36] proposed an input-level detection method. The intuition of this method is that even though a poisoned sample and a clean sample are classified into the target label, their intermediate representations are also different. Based on this observation, the poisoned samples can be detected easily. Wei et al.[37] presented a backdoor mitigation method using a small clean dataset. This method employs unlearning shared adversarial examples to purify the backdoored model. Researchers also proposed some backdoor defense methods without auxiliary clean datasets. It is a more strict and practical setting. For instance, Li et al.[18] proposed a defense approach that aims to train the clean model on the poisoned data. The main intuition of this method is that the models learn backdoor samples much faster than learning with clean samples. The backdoor examples can be easily removed by filtering out the low-loss examples at an early stage. Because the poisoned samples are much more sensitive to transformations than the clean samples in a backdoored model, Chen et al.[19] distinguish the poisoned samples from clean samples based on the feature consistency towards transformations. Weng et al.[38] found a trade-off between adversarial and backdoor robustness. Then, Gao et al.[39] challenged the trade-off between adversarial and backdoor robustness and proposed a backdoor defense strategy based on adversarial training regardless of the trigger pattern. Li et al.[40] found that the existing backdoor attacks have non-transferability. That is, the trigger sample is not effective in another model that has not been injected with the same trigger. Based on this observation, the authors proposed an input sample detection method by comparing the input sample and the samples picked from its predicted class label based on a feature extractor. Huang et al.[41] proposed a backdoor defense via decoupling the training process, thereby breaking the connection between the trigger and target label. Mu et al.[42] observed that the adversarial examples have similar behaviors as the triggered samples. Then, a progressive backdoor erasing method is proposed to purify the poisoned model via employing untargeted adversarial attacks. Tang et al.[43] presented a robust backdoor detection approach that can effectively detect data contamination attacks. Feng et al.[44] proposed a backdoor detection method for pre-trained encoders, requiring neither classifier headers nor input labels. Chen et al.[45] presented a robust backdoor defense scheme for federated learning. This scheme can overcome many backdoor attacks, including amplified magnitude sparsification, adaptive clipping, and so on.

Although many backdoor defense methods have been proposed, they have some limitations (e.g., requiring an auxiliary clean dataset) used in practical scenarios. Unlike existing backdoor defense methods, we first explore the relationship between the perturbation and the backdoor. Then, based on the exploration results, we plan to propose a robust backdoor defense method that does not require an auxiliary clean dataset and has a stable performance against various backdoor attacks on different models and poisoning ratios.

3 Preliminaries and Threat Model

In this section, we introduce the mainly related technologies and the threat model in this paper.

3.1 Backdoor Attack

A backdoor attack is to inject a trigger into a model by poisoning the training dataset. During the inference period, the backdoored model performs well on the original task, but outputs specific attacker-chosen responses when the input contains a specific trigger. For more clarity, we formalize the most common backdoor attack method BadNets [12] as follows:

Let f:𝒳𝒴:𝑓𝒳𝒴f:\mathcal{X}\rightarrow\mathcal{Y}italic_f : caligraphic_X → caligraphic_Y be a neural network for an image classification task. Let 𝒟={(𝒙i,yi)}i=1N𝒟superscriptsubscriptsubscript𝒙𝑖subscript𝑦𝑖𝑖1𝑁\mathcal{D}=\left\{\left(\boldsymbol{x}_{i},y_{i}\right)\right\}_{i=1}^{N}caligraphic_D = { ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT be a clean training set where 𝒙𝒊𝒳subscript𝒙𝒊𝒳\boldsymbol{x_{i}}\in\mathcal{X}bold_italic_x start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT ∈ caligraphic_X and 𝒚𝒊𝒴subscript𝒚𝒊𝒴\boldsymbol{y_{i}}\in\mathcal{Y}bold_italic_y start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT ∈ caligraphic_Y are the training image and the corresponding label, respectively. To conduct a backdoor attack, the attacker will choose a backdoor trigger 𝒃=(𝒎,𝒕)𝒃𝒎𝒕\boldsymbol{b}=(\boldsymbol{m},\boldsymbol{t})bold_italic_b = ( bold_italic_m , bold_italic_t ) that consists of a blending mask 𝒎𝒎\boldsymbol{m}bold_italic_m and a trigger pattern 𝒕𝒕\boldsymbol{t}bold_italic_t. In general, the trigger pattern is usually small just to achieve a stealthier backdoor attack. During the training process, the attacker randomly selects some training samples and poisons them by adding a specific backdoor trigger. For one poisoned training sample:

𝒙=(𝟏𝒎)×𝒙+𝒎×𝒕.superscript𝒙1𝒎𝒙𝒎𝒕\boldsymbol{x}^{\prime}=(\mathbf{1}-\boldsymbol{m})\times\boldsymbol{x}+% \boldsymbol{m}\times\boldsymbol{t}.bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = ( bold_1 - bold_italic_m ) × bold_italic_x + bold_italic_m × bold_italic_t . (1)

where ×\times× is pixel-wise multiplication and 𝒙superscript𝒙\boldsymbol{x}^{\prime}bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is the poisoned training sample. After modifying the training sample, the corresponding label is fixed with a target label. A successful backdoor model will maintain high accuracy in the main task while outputting the target label when the trigger 𝒕𝒕\boldsymbol{t}bold_italic_t appears. After BadNets [12], many different attack variations have been proposed to enhance effectiveness and stealthiness. For example, Blend [25] and PhysicalBA [46] have designed more complex patterns. WaNet [47] has proposed a stealthier input-specific-trigger attack. SIG [48] has proposed a clean-label attack that is stealthier as it does not change the labels. To evade the existing defense method, some adaptive attacks (e.g., TaCT [43], AdaptiveBlend [49] and AdaptivePatch [49]) have also been proposed to improve the attack performance.

Refer to caption
Figure 2: Threat Model of MeCa.

3.2 Threat Model

In the scenario considered in MeCa, the model trainer (Defender) intends to train a model on the training data obtained from a third party. A malicious data provider (Attacker) endeavors to manipulate the dataset in order to induce the defender to train a model with a backdoor, thereby facilitating subsequent hijacking of the model. The threat model is depcited in Fig.2. Therefore, in this paper, the defender utilizes MeCa to train a clean model on the poisoned dataset, free from the attacker’s backdoor triggers. Considering the deployment and development of MeCa in real-world scenarios, we provide a detailed description of the goals, knowledge, and capabilities of both the Defender and the Attacker in the following.

Defender’s goal: The defender aims to identify the poisoned samples and train a clean model on the poisoned dataset.

Defender’s knowledge and capabilities: The defender defines the model architecture, learning algorithm, and hyperparameters and trains the model. It has no knowledge about the poisoning ratio and without any auxiliary clean dataset.

Attacker’s goal: The attacker aims to inject a backdoor into the trained model by poisoning the training dataset.

Attacker’s knowledge and capabilities: The attacker can manipulate the training dataset, including poisoning a fraction of the training dataset and modifying the labels of the poisoned samples. However, the attacker can not access the model architecture and parameters. Moreover, it can not interfere with the model training process.

Refer to caption
Figure 3: Framework of MeCa.

4 Design of MeCa

In this section, we provide an overview of the proposed method and the detailed design.

4.1 Overview of MeCa

In this paper, we focus on how to identify the poisoned samples and train a clean model on the poisoned dataset. It is not trivial to satisfy this goal since even a tiny number of poisoned samples can accomplish the backdoor injection process. To address this problem, we first explore the inherent relationship between the perturbation and the backdoor attack theoretically and experimentally. Based on the results of the investigation, we propose a three-stage backdoor defense method dubbed MeCa and the overall framework is shown in Fig.3. There are three main phases in the proposed MeCa. Firstly, we partition a tiny number of clean samples using the difference between clean samples and poisoned samples when subjected to perturbations. An enhanced backdoor model is trained by unlearning the chosen clean samples and relearning the remaining dataset. Then, we accurately identify the poisoned and clean samples with a high degree of precision by leveraging the enhanced backdoor model. Specifically, samples that are misclassified by the enhanced backdoor model are deemed clean, while the remaining samples are classified as poisoned. Subsequently, a clean model can be trained using the clean samples identified. Finally, we relabel the poisoned samples and merge them with the clean ones to obtain a thoroughly clean dataset, enabling the fine-tuning of the clean model for enhanced performance. The design of MeCa is presented in detail as follows.

4.2 Detailed Design of MeCa

In this section, we present the detailed design of our backdoor defense approach.

4.2.1 Exploring the Relationship between Perturbation and Backdoor

To accurately identify poisoned samples, we need to enhance the backdoor model by unlearning certain clean samples. It is crucial to select a subset of clean samples without relying on an auxiliary dataset and to understand how poisoned samples affect the predictions made by backdoor models. Based on this understanding, we can design metrics to distinguish between clean and poisoned samples.

Our investigation begins with a review of the backdoor attack process, which is similar with [50]. Let 𝒟={(𝒙i,yi)}i=1N𝒟superscriptsubscriptsubscript𝒙𝑖subscript𝑦𝑖𝑖1𝑁\mathcal{D}=\{(\boldsymbol{x}_{i},y_{i})\}_{i=1}^{N}caligraphic_D = { ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT represent a clean training set, and let C:𝒳𝒴:𝐶𝒳𝒴C:\mathcal{X}\rightarrow\mathcal{Y}italic_C : caligraphic_X → caligraphic_Y denote the functionality of the target neural network. Each image 𝒙isubscript𝒙𝑖\boldsymbol{x}_{i}bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT in 𝒟𝒟\mathcal{D}caligraphic_D satisfies 𝒙i𝒳=[0,1]C×W×Hsubscript𝒙𝑖𝒳superscript01𝐶𝑊𝐻\boldsymbol{x}_{i}\in\mathcal{X}=[0,1]^{C\times W\times H}bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_X = [ 0 , 1 ] start_POSTSUPERSCRIPT italic_C × italic_W × italic_H end_POSTSUPERSCRIPT, and its corresponding label yi𝒴={1,,K}subscript𝑦𝑖𝒴1𝐾y_{i}\in\mathcal{Y}=\{1,\ldots,K\}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_Y = { 1 , … , italic_K }, where K𝐾Kitalic_K is the total number of classes. To initiate an attack, adversaries poison a selection of clean samples 𝒟psubscript𝒟𝑝\mathcal{D}_{p}caligraphic_D start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT using a covert transformation T()𝑇T(\cdot)italic_T ( ⋅ ). These poisoned samples are then combined with the clean dataset prior to training the compromised model. This process can be formalized as 𝒟t=𝒟𝒟psubscript𝒟𝑡𝒟subscript𝒟𝑝\mathcal{D}_{t}=\mathcal{D}\cup\mathcal{D}_{p}caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = caligraphic_D ∪ caligraphic_D start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT, where 𝒟p={(xi,yt)|xi=T(xi),(xi,yi)𝒟p}subscript𝒟𝑝conditional-setsuperscriptsubscript𝑥𝑖subscript𝑦𝑡formulae-sequencesuperscriptsubscript𝑥𝑖𝑇subscript𝑥𝑖subscript𝑥𝑖subscript𝑦𝑖subscript𝒟𝑝\mathcal{D}_{p}=\{(x_{i}^{\prime},y_{t})|x_{i}^{\prime}=T(x_{i}),(x_{i},y_{i})% \in\mathcal{D}_{p}\}caligraphic_D start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = { ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) | italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_T ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∈ caligraphic_D start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT }. Various methods have been devised to make T()𝑇T(\cdot)italic_T ( ⋅ ) more subtle and harder to detect, or to reduce the assumptions about the adversary’s capabilities, such as the poison ratio pr=|𝒟p|/|𝒟|𝑝𝑟subscript𝒟𝑝𝒟pr=|\mathcal{D}_{p}|/|\mathcal{D}|italic_p italic_r = | caligraphic_D start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT | / | caligraphic_D |. Despite these variations, the ultimate goal remains the same [50]: manipulating the neural network by training a malicious model:

min𝜽i=1Nb(f(𝒙i;𝜽),yi)+j=1Np(f(𝒙j;𝜽),yt).subscript𝜽superscriptsubscript𝑖1subscript𝑁𝑏𝑓subscript𝒙𝑖𝜽subscript𝑦𝑖superscriptsubscript𝑗1subscript𝑁𝑝𝑓superscriptsubscript𝒙𝑗𝜽subscript𝑦𝑡\min_{\boldsymbol{\theta}}\sum_{i=1}^{N_{b}}\mathcal{L}\left(f\left(% \boldsymbol{x}_{i};\boldsymbol{\theta}\right),y_{i}\right)+\sum_{j=1}^{N_{p}}% \mathcal{L}\left(f\left(\boldsymbol{x}_{j}^{\prime};\boldsymbol{\theta}\right)% ,y_{t}\right).roman_min start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT end_POSTSUPERSCRIPT caligraphic_L ( italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; bold_italic_θ ) , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUPERSCRIPT caligraphic_L ( italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ; bold_italic_θ ) , italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) . (2)

where Nb=|𝒟|subscript𝑁𝑏𝒟N_{b}=|\mathcal{D}|italic_N start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT = | caligraphic_D | and Np=|𝒟p|subscript𝑁𝑝subscript𝒟𝑝N_{p}=|\mathcal{D}_{p}|italic_N start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = | caligraphic_D start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT |.

The trained malicious models behave abnormally on the poisoned samples while performing normally on the clean samples. The incentive behind such abnormal behavior is the cornerstone for designing a defense method against backdoor attacks. Khaddaj et al.[51] validated that backdoor attacks corresponded to the strongest feature in the training data. Guo et al.[50] found that the predictions of poisoned samples were significantly more consistent compared to those of clean ones when amplifying all pixel values. The above observations indicate that the characteristics of the backdoor and clean ones differ, and it is possible to identify the backdoor samples. Inspired by the exploration of these previous works [52, 50, 6], we assume whether it’s possible to identify poisoned samples by exploiting their robustness. To verify this, we conduct experiments on BadNets [12] and Blend [25] attacks on CIFAR10 with ResNet18. The poisoning ratio of these attacks is set to 0.05 with a high attack success rate (ASR99%absentpercent99\geq 99\%≥ 99 %). To measure the robustness of a sample, we adopt the metric below:

kl(f(x),f(x×(1m^)+m^×δ))).\mathcal{L}_{kl}(f(x),f(x\times(1-\hat{m})+\hat{m}\times\delta))).caligraphic_L start_POSTSUBSCRIPT italic_k italic_l end_POSTSUBSCRIPT ( italic_f ( italic_x ) , italic_f ( italic_x × ( 1 - over^ start_ARG italic_m end_ARG ) + over^ start_ARG italic_m end_ARG × italic_δ ) ) ) . (3)

where δ=max(min(δ(fθ(x),y)δ(fθ(x),y)2,ϵ),ϵ)𝛿subscript𝛿subscript𝑓𝜃𝑥𝑦subscriptnormsubscript𝛿subscript𝑓𝜃𝑥𝑦2italic-ϵitalic-ϵ\delta=\max(\min(\frac{\nabla_{\delta}\ell(f_{\theta}(x),y)}{\|\nabla_{\delta}% \ell(f_{\theta}(x),y)\|_{2}},\epsilon),-\epsilon)italic_δ = roman_max ( roman_min ( divide start_ARG ∇ start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT roman_ℓ ( italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ) , italic_y ) end_ARG start_ARG ∥ ∇ start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT roman_ℓ ( italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ) , italic_y ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG , italic_ϵ ) , - italic_ϵ ) and m^^𝑚\hat{m}over^ start_ARG italic_m end_ARG is a randomly-generated mask of the perturbation. As shown in Fig.4, on both BadNets and Blend attacks, most poisoned samples exhibit good consistency against adversarially generated perturbations δ𝛿\deltaitalic_δ.

Refer to caption
(a) BadNets Attack
Refer to caption
(b) Blend Attack
Figure 4: Statistical histogram of training samples (with 5% poison ratio on CIFAR10 using ResNet18).

To further explain this intriguing phenomenon, we propose the corresponding proof based on the recent studies [50] to analyze the characteristics of poisoned samples. We start with the regression solution for the Neural Tangent Kernel (NTK) model as presented in [53, 50, 54]:

ϕt()=i=1NbQ(,𝒙i)yi+i=1NpQ(,𝒙i)yti=1NbQ(,𝒙i)+i=1NpQ(,𝒙i)subscriptitalic-ϕ𝑡superscriptsubscript𝑖1subscript𝑁𝑏𝑄subscript𝒙𝑖subscript𝑦𝑖superscriptsubscript𝑖1subscript𝑁𝑝𝑄superscriptsubscript𝒙𝑖subscript𝑦𝑡superscriptsubscript𝑖1subscript𝑁𝑏𝑄subscript𝒙𝑖superscriptsubscript𝑖1subscript𝑁𝑝𝑄superscriptsubscript𝒙𝑖\phi_{t}(\cdot)=\frac{\sum_{i=1}^{N_{b}}Q(\cdot,\boldsymbol{x}_{i})\cdot y_{i}% +\sum_{i=1}^{N_{p}}Q(\cdot,\boldsymbol{x}_{i}^{\prime})\cdot y_{t}}{\sum_{i=1}% ^{N_{b}}Q(\cdot,\boldsymbol{x}_{i})+\sum_{i=1}^{N_{p}}Q(\cdot,\boldsymbol{x}_{% i}^{\prime})}italic_ϕ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( ⋅ ) = divide start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_Q ( ⋅ , bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ⋅ italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_Q ( ⋅ , bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ⋅ italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_Q ( ⋅ , bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_Q ( ⋅ , bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_ARG (4)

Here:

  • ϕt()subscriptitalic-ϕ𝑡\phi_{t}(\cdot)italic_ϕ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( ⋅ ) stands for the predictive probability output of the function f(;θ)𝑓𝜃f(\cdot;\theta)italic_f ( ⋅ ; italic_θ ) for class t𝑡titalic_t,

  • 𝒙isubscript𝒙𝑖\boldsymbol{x}_{i}bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are clean training samples,

  • yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are the corresponding one-hot labels (so yi=1subscript𝑦𝑖1y_{i}=1italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 if 𝒙isubscript𝒙𝑖\boldsymbol{x}_{i}bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT belongs to class t𝑡titalic_t, and yi=0subscript𝑦𝑖0y_{i}=0italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 0 otherwise),

  • 𝒙isuperscriptsubscript𝒙𝑖\boldsymbol{x}_{i}^{\prime}bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT are the poisoned samples,

  • Q(𝒙,𝒛)=e2γ𝒙𝒛2𝑄𝒙𝒛superscript𝑒2𝛾superscriptnorm𝒙𝒛2Q(\boldsymbol{x},\boldsymbol{z})=e^{-2\gamma\|\boldsymbol{x}-\boldsymbol{z}\|^% {2}}italic_Q ( bold_italic_x , bold_italic_z ) = italic_e start_POSTSUPERSCRIPT - 2 italic_γ ∥ bold_italic_x - bold_italic_z ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT is the kernel function with γ>0𝛾0\gamma>0italic_γ > 0,

  • Nbsubscript𝑁𝑏N_{b}italic_N start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT and Npsubscript𝑁𝑝N_{p}italic_N start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT denote the number of clean and poisoned samples, respectively.

Assume the target label yt=1subscript𝑦𝑡1y_{t}=1italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 1 and all other labels are 0. Because the training samples are uniformly distributed, Nbksubscript𝑁𝑏𝑘\frac{N_{b}}{k}divide start_ARG italic_N start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT end_ARG start_ARG italic_k end_ARG clean samples are labeled as ytsubscript𝑦𝑡y_{t}italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. This allows us to simplify the equation to:

ϕt()=i=1Nb/kQ(,𝒙i)+i=1NpQ(,𝒙i)i=1NbQ(,𝒙i)+i=1NpQ(,𝒙i)subscriptitalic-ϕ𝑡superscriptsubscript𝑖1subscript𝑁𝑏𝑘𝑄subscript𝒙𝑖superscriptsubscript𝑖1subscript𝑁𝑝𝑄superscriptsubscript𝒙𝑖superscriptsubscript𝑖1subscript𝑁𝑏𝑄subscript𝒙𝑖superscriptsubscript𝑖1subscript𝑁𝑝𝑄superscriptsubscript𝒙𝑖\phi_{t}(\cdot)=\frac{\sum_{i=1}^{N_{b}/k}Q(\cdot,\boldsymbol{x}_{i})+\sum_{i=% 1}^{N_{p}}Q(\cdot,\boldsymbol{x}_{i}^{\prime})}{\sum_{i=1}^{N_{b}}Q(\cdot,% \boldsymbol{x}_{i})+\sum_{i=1}^{N_{p}}Q(\cdot,\boldsymbol{x}_{i}^{\prime})}italic_ϕ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( ⋅ ) = divide start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT / italic_k end_POSTSUPERSCRIPT italic_Q ( ⋅ , bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_Q ( ⋅ , bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_Q ( ⋅ , bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_Q ( ⋅ , bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_ARG (5)

Given that the backdoor sample 𝒙superscript𝒙\boldsymbol{x}^{\prime}bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT generally does not belong to the target class ytsubscript𝑦𝑡y_{t}italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, the contribution of clean samples in the numerator is negligible compared to the poisoned samples. Therefore, we can approximate ϕt()subscriptitalic-ϕ𝑡\phi_{t}(\cdot)italic_ϕ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( ⋅ ) by focusing primarily on the poisoned samples:

ϕt()i=1NpQ(,𝒙i)i=1NbQ(,𝒙i)+i=1NpQ(,𝒙i)subscriptitalic-ϕ𝑡superscriptsubscript𝑖1subscript𝑁𝑝𝑄superscriptsubscript𝒙𝑖superscriptsubscript𝑖1subscript𝑁𝑏𝑄subscript𝒙𝑖superscriptsubscript𝑖1subscript𝑁𝑝𝑄superscriptsubscript𝒙𝑖\phi_{t}(\cdot)\geq\frac{\sum_{i=1}^{N_{p}}Q(\cdot,\boldsymbol{x}_{i}^{\prime}% )}{\sum_{i=1}^{N_{b}}Q(\cdot,\boldsymbol{x}_{i})+\sum_{i=1}^{N_{p}}Q(\cdot,% \boldsymbol{x}_{i}^{\prime})}italic_ϕ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( ⋅ ) ≥ divide start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_Q ( ⋅ , bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_Q ( ⋅ , bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_Q ( ⋅ , bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_ARG (6)

Consider a specific backdoor sample 𝒙superscript𝒙\boldsymbol{x}^{\prime}bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, which is crafted as a mixture of a clean sample 𝒙𝒙\boldsymbol{x}bold_italic_x and a target pattern 𝒕𝒕\boldsymbol{t}bold_italic_t, i.e., 𝒙=(1m)𝒙+m𝒕superscript𝒙direct-product1𝑚𝒙direct-product𝑚𝒕\boldsymbol{x}^{\prime}=(1-m)\odot\boldsymbol{x}+m\odot\boldsymbol{t}bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = ( 1 - italic_m ) ⊙ bold_italic_x + italic_m ⊙ bold_italic_t, where direct-product\odot denotes element-wise multiplication, and m𝑚mitalic_m controls the blending ratio between 𝒙𝒙\boldsymbol{x}bold_italic_x and 𝒕𝒕\boldsymbol{t}bold_italic_t.

When Npsubscript𝑁𝑝N_{p}italic_N start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT approaches Nbsubscript𝑁𝑏N_{b}italic_N start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT (i.e., the poisoning ratio is close to 50%), the attack’s efficacy is maximized. In this case, the expression is:

ϕt(𝒙+𝜹)i=1NpQ(𝒙+𝜹,𝒙𝒊)i=1NbQ(𝒙+𝜹,𝒙𝒊)+i=1NpQ(𝒙+𝜹,𝒙𝒊)subscriptitalic-ϕ𝑡superscript𝒙superscript𝜹superscriptsubscript𝑖1subscript𝑁𝑝𝑄superscript𝒙superscript𝜹superscriptsubscript𝒙𝒊superscriptsubscript𝑖1subscript𝑁𝑏𝑄superscript𝒙superscript𝜹subscript𝒙𝒊superscriptsubscript𝑖1subscript𝑁𝑝𝑄superscript𝒙superscript𝜹superscriptsubscript𝒙𝒊\phi_{t}(\boldsymbol{x}^{\prime}+\boldsymbol{\delta}^{\prime})\geq\frac{\sum_{% i=1}^{N_{p}}Q\left(\boldsymbol{x}^{\prime}+\boldsymbol{\delta}^{\prime},% \boldsymbol{x}_{\boldsymbol{i}}^{\prime}\right)}{\sum_{i=1}^{N_{b}}Q\left(% \boldsymbol{x}^{\prime}+\boldsymbol{\delta}^{\prime},\boldsymbol{x}_{% \boldsymbol{i}}\right)+\sum_{i=1}^{N_{p}}Q\left(\boldsymbol{x}^{\prime}+% \boldsymbol{\delta}^{\prime},\boldsymbol{x}_{\boldsymbol{i}}^{\prime}\right)}italic_ϕ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + bold_italic_δ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ≥ divide start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_Q ( bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + bold_italic_δ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , bold_italic_x start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_Q ( bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + bold_italic_δ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , bold_italic_x start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT ) + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_Q ( bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + bold_italic_δ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , bold_italic_x start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_ARG (7)

To get the inequality blow:

ϕt(𝒙+𝜹)12subscriptitalic-ϕ𝑡superscript𝒙superscript𝜹12\phi_{t}(\boldsymbol{x}^{\prime}+\boldsymbol{\delta}^{\prime})\geq\frac{1}{2}italic_ϕ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + bold_italic_δ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ≥ divide start_ARG 1 end_ARG start_ARG 2 end_ARG (8)

We need to have:

i=1NbQ(𝒙+𝜹,𝒙𝒊)i=1NpQ(𝒙+𝜹,𝒙𝒊)superscriptsubscript𝑖1subscript𝑁𝑏𝑄superscript𝒙superscript𝜹subscript𝒙𝒊superscriptsubscript𝑖1subscript𝑁𝑝𝑄superscript𝒙superscript𝜹superscriptsubscript𝒙𝒊\sum_{i=1}^{N_{b}}Q\left(\boldsymbol{x}^{\prime}+\boldsymbol{\delta}^{\prime},% \boldsymbol{x}_{\boldsymbol{i}}\right)\leq\sum_{i=1}^{N_{p}}Q\left(\boldsymbol% {x}^{\prime}+\boldsymbol{\delta}^{\prime},\boldsymbol{x}_{\boldsymbol{i}}^{% \prime}\right)∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_Q ( bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + bold_italic_δ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , bold_italic_x start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT ) ≤ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_Q ( bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + bold_italic_δ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , bold_italic_x start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) (9)

Finally, using the kernel function Q(𝒙,𝒛)=e2γ𝒙𝒛2𝑄𝒙𝒛superscript𝑒2𝛾superscriptnorm𝒙𝒛2Q(\boldsymbol{x},\boldsymbol{z})=e^{-2\gamma\|\boldsymbol{x}-\boldsymbol{z}\|^% {2}}italic_Q ( bold_italic_x , bold_italic_z ) = italic_e start_POSTSUPERSCRIPT - 2 italic_γ ∥ bold_italic_x - bold_italic_z ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT, we see that:

i=1Np(1e2γm(𝒕𝒙i)2)>0superscriptsubscript𝑖1subscript𝑁𝑝1superscript𝑒2𝛾superscriptnormdirect-product𝑚𝒕subscript𝒙𝑖20\sum_{i=1}^{N_{p}}\left(1-e^{-2\gamma\|m\odot(\boldsymbol{t}-\boldsymbol{x}_{i% })\|^{2}}\right)>0∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( 1 - italic_e start_POSTSUPERSCRIPT - 2 italic_γ ∥ italic_m ⊙ ( bold_italic_t - bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ) > 0 (10)

This inequality shows that the internal term is positive, implying that f(𝒙+𝜹)=yt𝑓superscript𝒙superscript𝜹subscript𝑦𝑡f(\boldsymbol{x}^{\prime}+\boldsymbol{\delta}^{\prime})=y_{t}italic_f ( bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + bold_italic_δ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. Thus, the backdoor attack successfully causes the model to misclassify the poisoned sample as belonging to the target class ytsubscript𝑦𝑡y_{t}italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, confirming the attack’s effectiveness. Based on the above proof, we can further obtain the theorem below:

Refer to caption
Figure 5: Loss landscape of clean sample and poisoned sample.

Theorem 1. Let 𝒟={(𝐱i,yi)}i=1Nb𝒟superscriptsubscriptsubscript𝐱isubscriptyii1subscriptNb\mathcal{D}=\{(\boldsymbol{x}_{i},y_{i})\}_{i=1}^{N_{b}}caligraphic_D = { ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT end_POSTSUPERSCRIPT be a clean dataset consisting of NbsubscriptNbN_{b}italic_N start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT samples, and 𝒟p={(𝐱i,yt)}i=1Npsubscript𝒟psuperscriptsubscriptsuperscriptsubscript𝐱isuperscriptsubscriptyti1subscriptNp\mathcal{D}_{p}=\{(\boldsymbol{x}_{i}^{\prime},y_{t}^{\prime})\}_{i=1}^{N_{p}}caligraphic_D start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = { ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUPERSCRIPT be a poisoned dataset containing NpsubscriptNpN_{p}italic_N start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT samples, both independently and identically distributed (i.i.d.) according to a uniform distribution across KKKitalic_K classes. Suppose a deep neural network f(;θ)fθf(\cdot;\theta)italic_f ( ⋅ ; italic_θ ) is formulated as a multivariate kernel regression using a Radial Basis Function (RBF) kernel, and shares the same objective as the attackers. For a given attacked sample 𝐱=(𝟏𝐦)×𝐱+𝐦×𝐭superscript𝐱1𝐦𝐱𝐦𝐭\boldsymbol{x}^{\prime}=(\mathbf{1}-\boldsymbol{m})\times\boldsymbol{x}+% \boldsymbol{m}\times\boldsymbol{t}bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = ( bold_1 - bold_italic_m ) × bold_italic_x + bold_italic_m × bold_italic_t, where 𝟏1\mathbf{1}bold_1 is a vector of ones, 𝐦𝐦\boldsymbol{m}bold_italic_m is a binary mask, 𝐱𝐱\boldsymbol{x}bold_italic_x is the original sample, and 𝐭𝐭\boldsymbol{t}bold_italic_t is the trigger pattern, we have: limNpNbC(𝐱+δ)=ytsubscriptsubscriptNpsubscriptNbCsuperscript𝐱δsubscriptyt\lim_{N_{p}\to N_{b}}C(\boldsymbol{x}^{\prime}+\delta)=y_{t}roman_lim start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT → italic_N start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_C ( bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + italic_δ ) = italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, where C()CC(\cdot)italic_C ( ⋅ ) denotes the classification function of the network, δδ\deltaitalic_δ represents a perturbation added to the attacked sample, and ytsubscriptyty_{t}italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the target class label.

The theorem above demonstrates that when the size of 𝒟psubscript𝒟𝑝\mathcal{D}_{p}caligraphic_D start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT approaches that of the clean dataset, poisoned samples exhibit robustness against potential perturbations. To gain an intuitive understanding, we visualize the loss landscapes of both clean and poisoned samples (using the Blend attack [25]), as illustrated in Fig. 5. Notably, the loss of a clean sample shows steeper changes when perturbations are applied in the direction of the gradient. This observation further corroborates the previous theoretical analysis.

Algorithm 1 Untrusted Dataset Partition
0:  Untrusted Dataset 𝒟usubscript𝒟𝑢\mathcal{D}_{u}caligraphic_D start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT, Backdoored Model f^^𝑓\hat{f}over^ start_ARG italic_f end_ARG, Perturbation Radius ϵitalic-ϵ\epsilonitalic_ϵ, Patch Size r𝑟ritalic_r, Partition Rate p𝑝pitalic_p
0:  Partitioned Dataset 𝒟usubscript𝒟𝑢\mathcal{D}_{u}caligraphic_D start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT and 𝒟csubscript𝒟𝑐\mathcal{D}_{c}caligraphic_D start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT
1:  Q=[]𝑄Q=[\ ]italic_Q = [ ] // Initialize empty list Q𝑄Qitalic_Q
2:  for (x,y)𝑥𝑦(x,y)( italic_x , italic_y )  in  𝒟usubscript𝒟𝑢\mathcal{D}_{u}caligraphic_D start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT do
3:     // Generate a mask of the perturbation patch
4:     m^=RandomPatchMask(r)^𝑚𝑅𝑎𝑛𝑑𝑜𝑚𝑃𝑎𝑡𝑐𝑀𝑎𝑠𝑘𝑟\hat{m}=RandomPatchMask(r)over^ start_ARG italic_m end_ARG = italic_R italic_a italic_n italic_d italic_o italic_m italic_P italic_a italic_t italic_c italic_h italic_M italic_a italic_s italic_k ( italic_r )
5:     δ=max(min(δ(fθ(x),y)δ(fθ(x),y)2,ϵ),ϵ)𝛿subscript𝛿subscript𝑓𝜃𝑥𝑦subscriptnormsubscript𝛿subscript𝑓𝜃𝑥𝑦2italic-ϵitalic-ϵ\delta=\max(\min(\frac{\nabla_{\delta}\ell(f_{\theta}(x),y)}{\|\nabla_{\delta}% \ell(f_{\theta}(x),y)\|_{2}},\epsilon),-\epsilon)italic_δ = roman_max ( roman_min ( divide start_ARG ∇ start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT roman_ℓ ( italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ) , italic_y ) end_ARG start_ARG ∥ ∇ start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT roman_ℓ ( italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ) , italic_y ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG , italic_ϵ ) , - italic_ϵ )
6:     x^=x×(1m^)+m^×δ^𝑥𝑥1^𝑚^𝑚𝛿\hat{x}=x\times(1-\hat{m})+\hat{m}\times\deltaover^ start_ARG italic_x end_ARG = italic_x × ( 1 - over^ start_ARG italic_m end_ARG ) + over^ start_ARG italic_m end_ARG × italic_δ
7:     // Calculate KL𝐾𝐿KLitalic_K italic_L divergence of each sample
8:     Q.append(kl(f(x),f(x^)))formulae-sequence𝑄𝑎𝑝𝑝𝑒𝑛𝑑subscript𝑘𝑙𝑓𝑥𝑓^𝑥Q.append(\mathcal{L}_{kl}(f(x),f(\hat{x})))italic_Q . italic_a italic_p italic_p italic_e italic_n italic_d ( caligraphic_L start_POSTSUBSCRIPT italic_k italic_l end_POSTSUBSCRIPT ( italic_f ( italic_x ) , italic_f ( over^ start_ARG italic_x end_ARG ) ) )
9:  end for
10:  // Partition 𝒟usubscript𝒟𝑢\mathcal{D}_{u}caligraphic_D start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT according to Q𝑄Qitalic_Q and p𝑝pitalic_p
11:  𝒟u,𝒟c=sort(𝒟u,Q,p)subscript𝒟𝑢subscript𝒟𝑐𝑠𝑜𝑟𝑡subscript𝒟𝑢𝑄𝑝\mathcal{D}_{u},\mathcal{D}_{c}=sort(\mathcal{D}_{u},Q,p)caligraphic_D start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = italic_s italic_o italic_r italic_t ( caligraphic_D start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT , italic_Q , italic_p )
12:  return  𝒟u,𝒟csubscript𝒟𝑢subscript𝒟𝑐\mathcal{D}_{u},\mathcal{D}_{c}caligraphic_D start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT
Refer to caption
Figure 6: ASR and ACC of the model during backdoor enhance training on CIFAR10 with ResNet18.

4.2.2 Dataset Partition

Based on the findings in the above section, we plan to partition the untrusted dataset roughly by using the characteristic that the poisoned samples perform greater robustness against potential perturbations than the clean samples. However, as shown in Fig.4, these poisoned samples are still mixed with some clean ones. Therefore, in this section, we just pick out a small number of clean samples iteratively during the training stage rather than precisely selecting all the poisoned samples. Note that we do not require any auxiliary clean dataset.

Algorithm 1 shows the complete process of our dataset partition operation. This algorithm delineates a meticulous process for partitioning a dataset into trusted and potentially compromised subsets by exploiting the variances in the behavior of a model when subjected to perturbed inputs. At its core, the algorithm iterates over each sample (x,y)𝑥𝑦(x,y)( italic_x , italic_y ) in the dataset 𝒟usubscript𝒟𝑢\mathcal{D}_{u}caligraphic_D start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT, applying a strategically-sized perturbation patch characterized by a radius r𝑟ritalic_r and bounded by a perturbation radius ϵitalic-ϵ\epsilonitalic_ϵ, to the input sample. This is achieved through generating a random patch mask m^^𝑚\hat{m}over^ start_ARG italic_m end_ARG and calculating an optimal adversarial perturbation δ𝛿\deltaitalic_δ, which is applied to the input x𝑥xitalic_x to yield a perturbed counterpart x^^𝑥\hat{x}over^ start_ARG italic_x end_ARG. The key metric for partitioning, the KL divergence, is computed between the model’s predictions on the original and perturbed inputs, with these divergence scores being aggregated into a list Q𝑄Qitalic_Q. Subsequently, the dataset is partitioned into two subsets, 𝒟usubscript𝒟𝑢\mathcal{D}_{u}caligraphic_D start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT and 𝒟csubscript𝒟𝑐\mathcal{D}_{c}caligraphic_D start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT, based on the divergence scores relative to a partition rate p𝑝pitalic_p, indexing subsets with assumed clean and those potentially poisoned, respectively. This method essentially leverages the sensitivity of backdoored models to perturbations in identifying and isolating suspicious data samples, thus serving as a robust strategy in the preprocessing phase of data handling. With the above steps, we obtain 𝒟usubscript𝒟𝑢\mathcal{D}_{u}caligraphic_D start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT and 𝒟csubscript𝒟𝑐\mathcal{D}_{c}caligraphic_D start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT. Here, 𝒟csubscript𝒟𝑐\mathcal{D}_{c}caligraphic_D start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT denotes the selected small number of clean samples. 𝒟usubscript𝒟𝑢\mathcal{D}_{u}caligraphic_D start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT represents the rest of the training data that contains the poisoned samples.

4.2.3 Backdoor Enhancement and Standard Training

Note that a small number of clean samples 𝒟csubscript𝒟𝑐\mathcal{D}_{c}caligraphic_D start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT means nothing to train a clean model. However, motivated by [18], a small number of clean samples is enough to unlearn the functionality of a model on the main task. That’s to say, we can obtain a model f^θsubscript^𝑓𝜃\hat{f}_{\theta}over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT with a high attack success rate but low accuracy by unlearning some clean samples. To achieve a higher ASR and lower ACC, we train a backdoor-enhanced model iteratively by unlearning a small number of clean samples and learning the rest poisoned dataset. Concretely, for each iteration, p𝑝pitalic_p of samples with the lowest consistency will be labeled as clean samples 𝒟csubscript𝒟𝑐\mathcal{D}_{c}caligraphic_D start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT and unlearned:

θ=argmaxθ(x,y)𝒟cl(f^θ(x),y).𝜃subscriptsubscript𝜃𝑥𝑦subscript𝒟𝑐𝑙subscript^𝑓𝜃𝑥𝑦\theta={\arg\max\limits_{\theta}}_{(x,y)\in\mathcal{D}_{c}}l(\hat{f}_{\theta}(% x),y).italic_θ = roman_arg roman_max start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUBSCRIPT ( italic_x , italic_y ) ∈ caligraphic_D start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_l ( over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ) , italic_y ) . (11)

During this process, more clean samples will be picked out for unlearning, and the model will be trained on a dataset with an increasing poison ratio. As illustrated in Fig.6, ASR and ACC of the model continue to increase and decline, respectively. Algorithm 2 shows the iterative backdoor-enhanced training process.

Algorithm 2 Backdoor Enhancement and Standard Training
0:  Untrusted Dataset 𝒟usubscript𝒟𝑢\mathcal{D}_{u}caligraphic_D start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT, Backdoor Training Epochs Ebsubscript𝐸𝑏E_{b}italic_E start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT, Standard Training Epochs Essubscript𝐸𝑠E_{s}italic_E start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT
0:  Clean Model fθsubscript𝑓𝜃f_{\theta}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT
1:  𝒟^u=𝒟usubscript^𝒟𝑢subscript𝒟𝑢\mathcal{\hat{D}}_{u}=\mathcal{D}_{u}over^ start_ARG caligraphic_D end_ARG start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT = caligraphic_D start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT // Initialize 𝒟^usubscript^𝒟𝑢\mathcal{\hat{D}}_{u}over^ start_ARG caligraphic_D end_ARG start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT
2:  for e𝑒eitalic_e  in  range(Eb)𝑟𝑎𝑛𝑔𝑒subscript𝐸𝑏range(E_{b})italic_r italic_a italic_n italic_g italic_e ( italic_E start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) do
3:     θ=argminθ(x,y)𝒟^ul(f^θ(x),y)𝜃subscriptsubscript𝜃𝑥𝑦subscript^𝒟𝑢𝑙subscript^𝑓𝜃𝑥𝑦\theta={\arg\min\limits_{\theta}}_{(x,y)\in\mathcal{\hat{D}}_{u}}l(\hat{f}_{% \theta}(x),y)italic_θ = roman_arg roman_min start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUBSCRIPT ( italic_x , italic_y ) ∈ over^ start_ARG caligraphic_D end_ARG start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_l ( over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ) , italic_y )
4:     // Update 𝒟^usubscript^𝒟𝑢\mathcal{\hat{D}}_{u}over^ start_ARG caligraphic_D end_ARG start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT with partition algorithm
5:     // γ𝛾\gammaitalic_γ is a coefficient
6:     p=1(e+1)×γ𝑝1𝑒1𝛾p=1-(e+1)\times\gammaitalic_p = 1 - ( italic_e + 1 ) × italic_γ
7:     𝒟^u,𝒟c=Partition(𝒟u,f^θ,p)subscript^𝒟𝑢subscript𝒟𝑐𝑃𝑎𝑟𝑡𝑖𝑡𝑖𝑜𝑛subscript𝒟𝑢subscript^𝑓𝜃𝑝\mathcal{\hat{D}}_{u},\mathcal{D}_{c}=Partition(\mathcal{D}_{u},\hat{f}_{% \theta},p)over^ start_ARG caligraphic_D end_ARG start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = italic_P italic_a italic_r italic_t italic_i italic_t italic_i italic_o italic_n ( caligraphic_D start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT , over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT , italic_p )
8:     // Unlearn the clean samples
9:     θ=argmaxθ(x,y)𝒟cl(f^θ(x),y)𝜃subscriptsubscript𝜃𝑥𝑦subscript𝒟𝑐𝑙subscript^𝑓𝜃𝑥𝑦\theta={\arg\max\limits_{\theta}}_{(x,y)\in\mathcal{D}_{c}}l(\hat{f}_{\theta}(% x),y)italic_θ = roman_arg roman_max start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUBSCRIPT ( italic_x , italic_y ) ∈ caligraphic_D start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_l ( over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ) , italic_y )
10:  end for
11:  // Pick out the clean samples
12:  𝒟c={(x,y)𝒟u:f^θ(x)y}subscript𝒟𝑐conditional-set𝑥𝑦subscript𝒟𝑢subscript^𝑓𝜃𝑥𝑦\mathcal{D}_{c}=\{(x,y)\in\mathcal{D}_{u}:\hat{f}_{\theta}(x)\neq y\}caligraphic_D start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = { ( italic_x , italic_y ) ∈ caligraphic_D start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT : over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ) ≠ italic_y }
13:  for e𝑒eitalic_e  in  range(Es)𝑟𝑎𝑛𝑔𝑒subscript𝐸𝑠range(E_{s})italic_r italic_a italic_n italic_g italic_e ( italic_E start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) do
14:     θ=argminθ(x,y)𝒟cl(fθ(x),y)𝜃subscriptsubscript𝜃𝑥𝑦subscript𝒟𝑐𝑙subscript𝑓𝜃𝑥𝑦\theta={\arg\min\limits_{\theta}}_{(x,y)\in\mathcal{D}_{c}}l(f_{\theta}(x),y)italic_θ = roman_arg roman_min start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUBSCRIPT ( italic_x , italic_y ) ∈ caligraphic_D start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_l ( italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ) , italic_y )
15:  end for
16:  return  fθsubscript𝑓𝜃f_{\theta}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT

Obviously, our method does not end with training a backdoor-enhanced model, which plays a key role in identifying the poisoned samples. It’s natural to figure out the misclassified samples on the backdoored model and mark them as clean samples while the rest are labeled as poisoned samples. Finally, with the selected clean samples, we can train a clean model.

4.2.4 Relabel and Relearn

To obtain a completed clean dataset and further improve the performance of the clean model, MeCa relabels the poisoned samples and merges them with clean samples to obtain a clean and complete dataset:

𝒟com=𝒟clean{(xi,y^i)|y^i=f(x),(xi,yi)𝒟p}.subscript𝒟𝑐𝑜𝑚subscript𝒟𝑐𝑙𝑒𝑎𝑛conditional-setsubscript𝑥𝑖subscript^𝑦𝑖formulae-sequencesubscript^𝑦𝑖𝑓𝑥subscript𝑥𝑖subscript𝑦𝑖subscript𝒟𝑝\mathcal{D}_{com}=\mathcal{D}_{clean}\cup\{(x_{i},\hat{y}_{i})|\hat{y}_{i}=f(x% ),(x_{i},y_{i})\in\mathcal{D}_{p}\}.caligraphic_D start_POSTSUBSCRIPT italic_c italic_o italic_m end_POSTSUBSCRIPT = caligraphic_D start_POSTSUBSCRIPT italic_c italic_l italic_e italic_a italic_n end_POSTSUBSCRIPT ∪ { ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) | over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_f ( italic_x ) , ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∈ caligraphic_D start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT } . (12)

where 𝒟cleansubscript𝒟𝑐𝑙𝑒𝑎𝑛\mathcal{D}_{clean}caligraphic_D start_POSTSUBSCRIPT italic_c italic_l italic_e italic_a italic_n end_POSTSUBSCRIPT denotes the selected clean samples based on the backdoor-enhanced model and 𝒟comsubscript𝒟𝑐𝑜𝑚\mathcal{D}_{com}caligraphic_D start_POSTSUBSCRIPT italic_c italic_o italic_m end_POSTSUBSCRIPT represents the complete clean dataset. Then, the complete dataset can be used to fine-tune the clean model to further improve its performance. Note that our experiment mainly shows the results before relabeling and relearning to follow the setting in previous works [18, 19], and the performance of the model fine-tuned on the complete dataset is given separately.

5 Experiments

In this section, we experimentally evaluate the performance of the proposed MeCa. Firstly, we show the experimental settings, including the experimental environment, datasets, networks, backdoor attacks, and backdoor defenses. Then, we analyze and summarize the experimental results. Finally, we also present the ablation studies.

5.1 Experimental Settings

5.1.1 Datatsets

We run our experiments on four image datasets, including Imagnette, Tiny ImageNet, CIFAR10, and CIFAR100. The introduction of the above four datasets is shown as follows.

  • Imagnette [55]: Imagnette is a small dataset extracted from the large dataset ImageNet (more than 14141414 million images, 20,0002000020,00020 , 000 categories). Imagenette’s training set contains 9,46994699,4699 , 469 images, and its test set contains 3,92539253,9253 , 925 images, all in JPEG format. The image resolution is not uniform, but the width and height are no less than 160160160160 pixels.

  • Tiny ImageNet [56]: Tiny ImageNet contains 100,000100000100,000100 , 000 images of 200200200200 classes (500500500500 for each class) downsized to 64×64646464\times 6464 × 64 colored images. Each class has 500500500500 training images, 50505050 validation images, and 50505050 test images.

  • CIFAR10[57]: CIFAR10 dataset is a database of tiny images, containing 50,0005000050,00050 , 000 training images and 10,0001000010,00010 , 000 testing images with 10101010 classes.

  • CIFAR100 [57]: CIFAR100 contains 100100100100 categories. Each category contains 500500500500 training images with 32×32323232\times 3232 × 32 and 100100100100 test images with 32×32323232\times 3232 × 32.

5.1.2 Networks

We conduct our experiments on three deep learning models: ResNet18, ResNet34, and MobileNetV2. Unless otherwise stated, the default experimental setup assumes the use of ResNet18.

5.1.3 Experimental Environment

All experiments were conducted on a server equipped with Intel i9-9900K, 3.60GHz processor, 32GB RAM, NVIDIA GeForce RTX 3090, and PyTorch.

5.1.4 Backdoor Attacks

We employ 8888 state-of-the-art backdoor attacks to evaluate the proposed MeCa in our experiments, including 3333 dirty-label attacks: BadNet [12], Blend [25], and PhysicalBA [46], one clean-label attack: SIG [48], one input-specific-trigger attack: WaNet [47], and 3333 adaptive attacks: TaCT [43], AdaptiveBlend [49] and AdaptivePatch [49]. The detailed configurations of these backdoor attacks are shown in Table I.

TABLE I: The detailed settings of the 8888 state-of-the-art backdoor attacks.
Attack Method Poison Ratio Trigger Static/Dynamic Dirty/Clean Adaptive Source Target Cover Rate
BadNet 0.05 3*3 white square Static Dirty No / 1 /
Blend 0.05 HelloKity , alpha=0.2 Static Dirty No / 1 /
SIG 0.05 Sinusoidal signal Static Clean No / 1 /
WaNet 0.05 Warping-based triggers Dynamic Dirty No / 1 0.05
PhysicalBA 0.05 Firefox Static Dirty No / 1 /
TaCT 0.05 3*3 white square Static Dirty Yes 0 1 0.01
AdaptiveBlend 0.05 HelloKity , alpha=0.2 Static Dirty Yes / 1 0.01
AdaptivePatch 0.05 Firefox*4 Static Dirty Yes / 1 0.01

5.1.5 Backdoor Defense

We compare MeCa with 7777 established backdoor defense baselines, including FineTuning [58], FinePruning [58], CutMix [59], CLP [24], DBR [19], SCAnFT [43], and ABL [18].

Our detailed configurations for baseline defense (CIFAR10) are listed as follows:

  • FineTuning: We use 2,00020002,0002 , 000 reserved clean samples to finetune the “full layer” of the model, and the learning rate in the repair phase is set to 0.0010.0010.0010.001 with 10101010 epochs.

  • FinePruning: We use 2,00020002,0002 , 000 reserved clean samples to prune the 30%percent3030\%30 % of the channels of the second layer.

  • CutMix: Probability of CutMix is set to 1111 with β=1,γ=1formulae-sequence𝛽1𝛾1\beta=1,\gamma=1italic_β = 1 , italic_γ = 1. To repair the backdoored model, the learning rate in the repair phase is 0.010.010.010.01 for 10101010 epochs.

  • CLP: Similarly, CLP needs to prune abnormal channels of the model that are out of 3333 standard deviations and repair the model for 10101010 epochs.

  • DBR: DBR tries to pick out with high sensitivity to “rotate” and “affine” transformation. We set clean ratio and poison ratio to 0.40.40.40.4 and 0.050.050.050.05, respectively, and unlearn the poisoned samples for 5555 epochs with 0.00010.00010.00010.0001 unlearning rate.

  • SCAnFT: SCAnFT is a poisoned sample detection algorithm and here we adopt it to pick out the triggered samples, and unlearn them with the unlearning rate set to 0.00010.00010.00010.0001.

  • ABL: ABL tries to pick out poisoned samples by identifying the samples with lower loss values, and the threshold γ𝛾\gammaitalic_γ of loss values is set to 0.80.80.80.8, and the unlearning rate is 0.00010.00010.00010.0001 for 5555 epochs.

5.1.6 Evaluation Metrics

We employ the attack success rate (ASR) and accuracy (ACC) to evaluate the performance of the proposed backdoor defense framework. ASR indicates the ratio of the triggered samples that are misclassified as the target label. ACC represents the accuracy of the model on the clean samples. The lower ASR and the higher ACC indicate the better performance of the backdoor defense mechanism.

Specifically, we suppose a defense is successful if the post-defense ASR is under 20%percent2020\%20 %, and unsuccessful otherwise, as done in prior works [20, 60].

TABLE II: The comparison results (%) between our MeCa and 7777 state-of-the-art backdoor defenses methods against 5555 state-of-the-art backdoor attacks on ResNet18 with 5%percent55\%5 % poisoned ratio.
No Defense FineTuning FinePruning CutMix CLP DBR SCAnFT ABL MeCa
Dataset Types ACC ASR ACC ASR ACC ASR ACC ASR ACC ASR ACC ASR ACC ASR ACC ASR ACC ASR
BadNet 94.10 93.70 71.47 32.13 61.50 54.85 72.30 56.00 33.24 6.09 69.81 57.62 46.54 16.34 57.61 58.45 73.15 0.00
Blend 96.60 95.50 70.08 88.92 60.94 93.63 55.12 92.80 13.30 0.00 45.15 70.64 57.89 17.17 62.60 79.22 73.13 0.00
SIG 96.40 86.30 70.36 71.47 60.39 15.51 70.63 88.09 16.90 74.24 61.50 0.28 41.83 0.28 16.90 83.38 73.96 0.00
WaNet 94.30 45.90 69.81 11.08 59.56 41.00 68.98 49.86 29.36 48.48 64.82 7.20 52.35 12.47 36.84 78.67 70.64 0.00
PhysicalBA 89.30 100.00 58.73 100.00 54.02 99.72 56.79 59.78 23.27 0.00 63.43 6.37 77.03 10.77 63.37 2.77 60.11 0.00
Imagenette Average 94.14 84.28 68.09 60.72 59.28 60.94 64.76 69.31 23.21 25.76 60.94 28.42 55.13 11.41 47.46 60.49 70.20 0.00
BadNet 92.00 95.20 83.30 96.81 75.05 93.19 81.21 83.30 34.18 0.55 67.47 1.21 69.89 2.09 71.98 0.00 85.72 0.00
Blend 84.50 99.90 84.62 100.00 79.23 98.79 80.44 99.67 51.00 3.52 37.03 98.46 78.57 1.43 82.20 5.71 85.71 0.00
SIG 84.70 95.40 82.31 95.93 80.88 90.11 79.12 99.12 37.47 71.97 38.57 93.52 66.92 3.63 85.93 48.35 87.69 2.64
WaNet 82.80 51.30 83.41 70.33 79.78 61.65 76.70 31.10 12.75 78.02 26.70 6.15 56.70 60.77 83.74 88.57 83.63 0.00
PhysicalBA 91.60 100.00 86.04 100.00 80.11 49.78 85.38 79.23 27.47 0.00 30.55 5.71 79.01 0.33 87.14 0.99 82.97 0.00
CIFAR10 Average 87.12 88.36 83.94 92.61 79.01 78.70 80.57 78.48 32.57 30.81 40.06 41.01 70.22 13.65 82.20 28.72 85.14 0.53
BadNet 72.10 98.20 58.40 91.40 42.00 91.09 42.91 86.03 54.55 83.30 29.35 0.10 56.78 0.00 63.16 0.10 60.53 0.00
Blend 51.90 100.00 60.12 99.80 31.58 99.79 48.99 99.70 34.31 59.51 8.20 89.88 48.89 0.00 61.40 1.21 60.53 0.00
SIG 54.40 95.00 60.32 96.86 40.28 67.71 48.99 95.95 43.72 69.13 7.39 80.10 44.03 2.94 62.96 90.59 57.49 1.11
WaNet 53.60 48.60 59.82 47.98 30.67 19.63 44.64 62.25 48.89 81.98 5.77 15.89 41.70 2.13 63.46 3.85 58.20 0.00
PhysicalBA 71.10 100.00 70.95 100.00 59.51 7.69 60.22 86.03 45.14 96.05 51.32 77.43 59.21 0.00 61.53 0.00 58.20 0.00
CIFAR100 Average 60.62 88.36 61.92 87.21 40.81 57.18 49.15 85.99 45.32 77.99 20.41 52.68 50.12 1.01 62.50 19.15 58.99 0.22
BadNet 37.80 97.95 46.00 98.54 26.87 97.58 31.55 90.84 5.08 84.90 0.70 0.00 32.21 0.00 46.80 0.00 35.73 0.00
Blend 34.60 99.95 49.20 100.00 22.29 96.98 32.11 99.80 10.37 18.32 21.54 99.75 5.33 0.00 38.65 0.00 35.73 0.00
SIG 36.50 98.80 47.26 97.79 25.52 89.43 33.87 99.39 7.60 100.00 0.75 0.00 29.89 0.00 26.27 78.86 38.60 0.00
WaNet 36.90 97.30 46.00 96.43 21.24 98.14 30.10 98.74 9.96 35.63 0.35 32.10 26.07 0.00 37.54 99.50 40.56 0.00
PhysicalBA 48.15 100.00 52.64 100.00 33.27 33.57 42.12 93.51 12.33 20.63 1.06 98.64 29.54 0.00 40.92 100.00 38.70 0.00
TinyImageNet Average 38.79 98.80 48.22 98.55 25.84 83.14 33.95 96.46 9.07 51.90 4.88 46.10 24.61 0.00 38.04 55.67 37.86 0.00

5.2 Comparative Evaluation of MeCa and Existing Defenses

In this paper, we employ 8888 state-of-the-art backdoor attacks (including 5555 familiar backdoor attacks and 3333 adaptive backdoor attacks) to demonstrate the effectiveness of our MeCa. In this section, we take into account 5555 state-of-the-art backdoor attacks and compare the performance of MeCa with 7777 state-of-the-art backdoor defense techniques. Table II shows the performance of the proposed MeCa method on Imagenette, CIFAR10, CIFAR100, and TinyImageNet. Obviously, our MeCa achieves the best results in reducing ASR against most backdoor attacks while maintaining a satisfactory ACC across all 4444 datasets.

Concretely, we mainly have 5 observations from Table II. The detailed analysis is shown as follows:

(1) Our defense method can resist to 5555 state-of-the-art backdoor attacks with an excellent performance. The ASR remains almost at 0.00%percent0.000.00\%0.00 % on the 5555 backdoor attacks and 4444 datasets. The highest ASR is just 2.64%percent2.642.64\%2.64 %. Moreover, compared with no defense, our method has a slight drop on the main task on three datasets (CIFAR10: 87.12%percent87.1287.12\%87.12 % vs. 85.14%percent85.1485.14\%85.14 %, CIFAR100: 60.62%percent60.6260.62\%60.62 % vs. 58.99%percent58.9958.99\%58.99 %, and TinyImageNet: 38.79%percent38.7938.79\%38.79 % vs. 37.86%percent37.8637.86\%37.86 %). However, we find that most defense methods fail to maintain the trade-off between ACC and ASR on ImageNette. This may be attributed to the small scale of this dataset since it is sampled from a larger and more complex dataset, ImageNet, which means that the convergence of the models trained on ImageNette may be largely affected even with little data missing. With 5% data poisoned, the original performance on this dataset will also be affected.

(2) Apart from SCAnFT, the other 6666 state-of-the-art backdoor defense techniques perform poorly. The highest average ASR is 98.55%percent98.5598.55\%98.55 % with the FineTuning on TinyImageNet. Only for the ABL on CIFAR100, the average ASR is 19.15%percent19.1519.15\%19.15 %, which is less than 20.00%percent20.0020.00\%20.00 %. The average ASR of all the other defense techniques is more than 20.00%percent20.0020.00\%20.00 %. For SCAnFT, it has a fine defense performance on four datasets. The average ASR is 11.41%percent11.4111.41\%11.41 %, 13.65%percent13.6513.65\%13.65 %, 1.01%percent1.011.01\%1.01 %, and 0.00%percent0.000.00\%0.00 %, respectively, when the SCAnFT performs on the four datasets. However, SCAnFT has an obvious influence on the main task (from 94.14%percent94.1494.14\%94.14 % to 55.13%percent55.1355.13\%55.13 %, from 87.12%percent87.1287.12\%87.12 % to 70.22%percent70.2270.22\%70.22 %, from 60.62%percent60.6260.62\%60.62 % to 50.12%percent50.1250.12\%50.12 %, from 38.79%percent38.7938.79\%38.79 % to 24.61%percent24.6124.61\%24.61 %). This indicates that SCAnFT has significant limitations when used in practical scenarios.

(3) The existing defense methods perform unsteadily for the same backdoor attack on the different datasets. For example, for DBR under the SIG, the ASR on four datasets is 0.28%percent0.280.28\%0.28 %, 93.52%percent93.5293.52\%93.52 %, 80.10%percent80.1080.10\%80.10 %, and 0.00%percent0.000.00\%0.00 %, respectively. For ABL under the Blend, the ASR on four datasets is 79.22%percent79.2279.22\%79.22 %, 5.71%percent5.715.71\%5.71 %, 1.21%percent1.211.21\%1.21 %, and 0.00%percent0.000.00\%0.00 %, respectively. Since the complexity, scale, and class numbers of the different datasets vary, the convergence speed of attack is also different. This results in a high false positive rate for poisoned sample identification when the backdoor task has not been learned. In this case, some clean samples may be picked out. Unlearning these clean samples will significantly influence the model performance.

(4) All the existing defense methods have an unstable performance against different backdoor attacks. They perform satisfactorily on some backdoor attacks and may be ineffective against the other backdoor attacks. This greatly reduces the availability of these defense methods in real-world scenarios because it is impractical for the defender to know the backdoor methods in advance. For example, ABL performs well on the BadNet attack but has an unsatisfactory performance on the SIG attack. Note that we have observed the different performance of these defense methods in different papers because their hyperparameters are different. We argue that it’s not practical to set different hyperparameters for different attacks because defenders have no knowledge about the potential attacks. Therefore, we use the same hyperparameter configurations for other defenses as well as MeCa.

(5) Compared with the 7777 state-of-the-art backdoor defense techniques, our MeCa can achieve the lowest ASR (the highest ASR is just 2.64%percent2.642.64\%2.64 %) while maintaining a satisfactory performance on the clean samples. Moreover, the proposed MeCa has a more stable performance on different datasets and various backdoor attacks.

TABLE III: The comparison results (%) between our MeCa and 7777 state-of-the-art backdoor defenses methods against 3333 state-of-the-art adaptive attacks on ResNet18 with 5%percent55\%5 % poisoned ratio.
No Defense FineTuning FinePruning CutMix CLP DBR SCAnFT ABL MeCa
Dataset Types ACC ASR ACC ASR ACC ASR ACC ASR ACC ASR ACC ASR ACC ASR ACC ASR ACC ASR
TaCT 94.90 24.30 72.02 3.60 61.50 4.43 69.00 21.88 57.62 9.14 66.76 9.97 59.28 0.55 50.69 27.15 67.31 0.00
AdaptivaBlend 95.80 70.10 69.81 11.36 63.16 29.36 70.36 2.21 22.99 21.88 42.11 36.01 47.37 12.47 31.02 55.12 69.25 0.83
AdaptivaPatch 93.10 94.10 70.91 0.00 52.35 49.31 65.93 43.49 51.52 0.00 61.77 45.98 58.17 0.28 38.78 0.00 73.41 0.00
Imagenette Average 94.60 62.83 70.91 4.99 59.00 27.70 68.43 22.53 44.04 10.34 56.88 30.65 54.94 4.43 40.16 32.92 69.99 0.28
TaCT 82.10 63.20 82.63 65.71 75.05 65.71 78.68 41.43 37.91 3.08 77.58 63.19 72.42 62.20 85.16 48.02 82.42 0.00
AdaptivaBlend 84.00 53.30 82.09 66.48 80.88 54.40 74.95 49.12 49.45 5.49 45.27 12.09 78.90 0.22 88.68 3.41 83.30 0.00
AdaptivaPatch 84.80 37.30 84.07 49.78 82.42 0.11 61.65 6.48 70.33 2.09 80.22 25.27 72.64 0.00 83.30 99.34 82.86 0.00
CIFAR10 Average 83.63 51.27 82.93 60.66 79.45 40.07 71.76 32.34 52.56 3.55 67.69 33.52 74.65 20.81 85.71 50.26 82.86 0.00
TaCT 55.40 22.80 58.70 25.00 42.00 17.71 53.24 17.41 18.72 32.29 41.40 29.15 44.13 0.61 63.26 0.91 61.84 0.00
AdaptivaBlend 55.00 52.90 58.70 55.26 40.89 25.20 42.81 22.47 37.55 6.17 2.02 97.87 57.29 52.83 62.15 36.64 63.16 0.00
AdaptivaPatch 33.40 90.70 57.29 88.46 26.52 39.78 42.61 22.57 30.16 1.11 8.00 1.52 57.19 0.00 58.00 0.20 59.82 0.00
CIFAR100 Average 47.93 55.47 58.23 56.24 36.47 27.56 46.22 20.82 28.81 13.19 17.14 42.85 52.87 17.81 61.14 12.58 61.61 0.00
TaCT 36.95 13.65 46.55 10.87 26.07 8.05 32.36 19.38 2.62 0.00 0.55 0.00 25.42 0.00 40.71 20.73 38.80 0.00
AdaptivaBlend 34.75 73.40 46.00 74.53 24.81 35.78 32.91 56.06 18.52 20.23 0.20 77.10 23.45 0.00 30.20 52.04 40.82 0.00
AdaptivaPatch 38.55 24.35 47.10 99.90 22.40 32.56 32.56 53.85 6.95 8.81 0.45 0.55 33.67 0.00 42.78 99.70 39.71 0.00
TinyImageNet Average 36.75 37.13 46.55 61.77 24.43 25.46 32.61 43.10 9.36 9.68 0.40 25.88 27.51 0.00 37.90 57.49 39.78 0.00

5.3 The Resistance to Adaptive Attacks

To further reveal the potential risk of the MeCa, we also consider the adaptive adversaries that try to design special backdoor attacks to escape our MeCa method. The adaptive adversaries deliberately establish dependencies between the backdoor and normal functionality, which immensely increases the difficulty of backdoor sample detection. Therefore, we employ three adaptive attacks (TaCT [43], AdaptiveBlend [49] and AdaptivePatch [49]) to demonstrate the effectiveness of our defense method. The three adaptive attacks build the dependencies between the backdoor and normal functions by using different poisoning strategies. Table III shows the defense results of our method and other 7777 state-of-the-art backdoor defense techniques against the 3333 adaptive attacks on 4444 different datasets. From Table III, we have the following three observations. First, our MeCa has the lowest ASR (nearly 0.00%percent0.000.00\%0.00 %) while maintaining a satisfactory ACC of the main task. Then, some existing defense methods (CLP and SCAnFT) have a good defense performance against the three adaptive attacks, but they have an obvious decrease in ACC. Specifically, for CLP on the CIFAR10, the ASR is 3.55%percent3.553.55\%3.55 %, and the ACC drops from 83.63%percent83.6383.63\%83.63 % to 52.56%percent52.5652.56\%52.56 %. Finally, our method has a better generalization on different datasets. For different datasets, our method always maintains a fine performance, while the performance of the other backdoor defense techniques has an obvious fluctuation. In summary, compared with the 7777 state-of-the-art backdoor defense techniques, our method has a better performance and greater generalization against adaptive attacks.

5.4 Ablation Studies

Different from the existing backdoor defense methods, we also propose to improve the performance of our MeCa further by relabeling and relearning the backdoor samples. Hence, in this section, we first show the MeCa’s performance after employing relabeling and relearning mechanisms. Then, to demonstrate our MeCa’s generalization, we also experimentally explore the performance of our MeCa on different partition rates, poison ratios, and models. Finally, we also explore the impact of our MeCa on the clean and poisoned dataset.

5.4.1 Performance with Relabeling and Relearning

In our MeCa, after identifying the backdoor samples and clean samples, we relabel the backdoor samples and merge them with clean samples to obtain a clean and complete dataset, which is used to fine-tune the clean model then. Accordingly, we conduct experiments on the ResNet18 and CIFAR10 datasets to show the effect of the relabeling and relearning mechanism. Table IV shows the ACC and ASR of our MeCa after using relabeling and relearning. Compared with Table II and III, we can know that the ACC further improves, and the ASR also decreases a little. Specifically, after employing relabeling and relearning, the average ACC and ASR changes from 84.29%percent84.2984.29\%84.29 % to 86.65%percent86.6586.65\%86.65 % and from 0.33%percent0.330.33\%0.33 % to 0.03%percent0.030.03\%0.03 % against the 8888 backdoor attacks, respectively. We infer that the model learns better due to the increase of clean samples. More interestingly, we find that employing relabeling and relearning has a positive effect on clean-label backdoor attacks (e.g., SIG). The ASR drops from 2.64%percent2.642.64\%2.64 % to 0.22%percent0.220.22\%0.22 % after equipping the relabeling and relearning mechanism. We deduce that the efficacy of backdoor task learning is closely linked to the initial weights of the model. Therefore, a pretrained model can hardly be backdoored by poisoned samples, which also indicates model service providers use pretrained models as the base models and fine-tune them with untrusted data to avoid poisoning attacks.

TABLE IV: The performance (%) of ResNet18 with relabeling and complete CIFAR10 dataset.
Stratagies→
W/O Relabeling
and Relearning
W/ Relabeling
and Relearning
Attack↓ ACC ASR ACC ASR
BadNet 85.72 0.00 86.81 0.00
Blend 85.71 0.00 87.03 0.00
SIG 87.69 2.64 86.59 0.22
WaNet 83.63 0.00 86.92 0.00
PhysicalBA 82.97 0.00 83.52 0.00
TaCT 82.42 0.00 86.92 0.00
AdaptivaBlend 83.30 0.00 89.12 0.00
AdaptivaPatch 82.86 0.00 86.26 0.00
Average 84.29 0.33 86.65 0.03

5.4.2 Effectiveness with Different Partition Rates

In our MeCa method, we have to select a certain percentage of samples based on the klsubscript𝑘𝑙\mathcal{L}_{kl}caligraphic_L start_POSTSUBSCRIPT italic_k italic_l end_POSTSUBSCRIPT value to train a backdoor enhancement model. This means that the partition rate may be important for MeCa’s performance. Therefore, to explore the effect of partition rate on performance, we conduct experiments on CIFAR10 and 8888 backdoor attacks when the partition rate is 0.050.050.050.05, 0.100.100.100.10, 0.150.150.150.15, 0.200.200.200.20, and 0.250.250.250.25, respectively. The experimental results are shown in Fig.7, which shows that the partition rates have little influence on the MeCa’s performance. Our method on all the backdoor attacks apart from SIG has a stable performance. The ASR is always nearly 0.00%percent0.000.00\%0.00 % except for SIG, in which the ASR causes a little fluctuation with a very small margin around 0.00%percent0.000.00\%0.00 %. We infer that it’s a normal fluctuation because of the occasionality of each experiment. For ACC, there is also a little fluctuation around 85.00%percent85.0085.00\%85.00 % for all the backdoor attacks. We infer that it is also a normal fluctuation due to the occasionality of each experiment.

Refer to caption
Figure 7: The ASR and ACC (%) of MeCa with different partition rates.
TABLE V: Impact of poisoning ratio on defense performance (%).
Poison Ratio→ 0.1 0.15 0.2
Method↓ ACC ASR ACC ASR ACC ASR
BadNet 85.82 0.00 85.38 0.00 82.20 0.00
Blend 82.00 0.00 83.30 0.00 82.20 0.00
WaNet 83.74 0.00 85.05 0.00 82.42 0.00
PhysicalBA 82.42 0.00 81.76 0.00 81.10 0.00
TaCT 78.35 0.00 78.90 0.00 81.10 0.00
AdaptiveBlend 87.90 0.00 87.69 0.00 78.35 0.00
AdaptivePatch 87.58 0.00 87.25 0.00 86.26 0.00
Average 83.50 0.00 83.87 0.00 81.98 0.00
  • \bullet Without attack: ACC-86.04%, ASR-0.66%

5.4.3 Effectiveness with Different Poisoning Ratios

In real-world application scenarios, it is impractical for the defender to access the poisoning ratios of the training data. Therefore, we also demonstrate that our MeCa is suitable for multiple poison ratios. Here, we experiment it on CIFAR10 against 7777 backdoor attacks, including BadNet, Blend, WaNet, PhysicalBA, TaCT, AdaptivaBlend, and AdaptivaPatch with poisoning ratios up to 0.10.10.10.1, 0.150.150.150.15, 0.20.20.20.2. Note that SIG is not considered here because it is a clean-label attack that can just poison one-class samples. Moreover, the number of one-class samples is a maximum of 10.00%percent10.0010.00\%10.00 % in CIFAR10. That is, the poisoning ratios of SIG on CIFAR10 do not exceed 10.00%percent10.0010.00\%10.00 %. Table V shows the experimental results. From Table V, we can know that with the variation of poisoning ratios, the ASR of our method is always 0.00%percent0.000.00\%0.00 % against the 7777 backdoor attacks. In particular, we can find that the ASR is 0.66%percent0.660.66\%0.66 % when the poisoning ratio is zero. We suppose that the reason is the misclassification of the model on the clean samples. In addition, the experimental results show that our method also maintains a fine ACC against different backdoor attacks even without employing relabeling and relearning methods.

5.4.4 Effectiveness with Different Models

To demonstrate the generalizability of our method, we conduct experiments on ResNet34 and MobileNetV2 with the poisoning ratio of 0.050.050.050.05. Table VI shows the experimental results on CIFAR10. From Table VI, we can find that our MeCa works similarly well on ResNet34 and MobileNetV2. The ASR is always 0.00%percent0.000.00\%0.00 % in the two model architectures. Moreover, the ACC also maintains a satisfactory performance (from 88.60%percent88.6088.60\%88.60 % to 88.70%percent88.7088.70\%88.70 % in ResNet34, from 84.33%percent84.3384.33\%84.33 % to 81.89%percent81.8981.89\%81.89 % in MobileNetV2). The experimental results indicate that our MeCa has good generalizability between different models.

TABLE VI: The performance (%) of ResNet34 and MobileNetV2 on CIFAR10.
Model→ ResNet34 MobileNetV2
Attack↓ ACC ASR ACC ASR
No Attack 88.60 0.00 84.33 0.00
BadNet 88.35 0.00 80.30 0.00
Blend 88.35 0.00 81.76 0.00
SIG 89.56 0.00 82.31 0.00
WaNet 88.35 0.00 81.20 0.00
PhysicalBA 87.69 0.00 80.21 0.00
TaCT 89.45 0.00 82.63 0.00
AdaptiveBlend 89.01 0.00 80.22 0.00
AdaptivePatch 88.90 0.00 84.07 0.00
Average 88.70 0.00 81.89 0.00
TABLE VII: Impact of MeCa on the poisoned dataset.
Poison w/o defense Poison w/ defense
Attack Method ACC ASR ACC ASR
BadNet 92.00 95.20 85.72 0.00
Blended 84.50 99.90 85.71 0.00
SIG 84.70 95.40 87.69 2.64
WaNet 82.80 51.30 83.63 0.00
PhysicalBA 91.60 100.00 82.97 0.00
TaCT 82.10 63.20 82.42 0.00
AdaptivaBlend 84.00 53.30 83.30 0.00
AdaptivaPatch 84.80 37.30 82.86 0.00
Average 85.81 74.45 84.29 0.33
  • \bullet Clean dataset without defense: ACC-87.86%, ASR-0.00%

  • \bullet Clean dataset with defense: ACC-86.04%, ASR-0.66%

5.4.5 Impact on clean and poisoned dataset

In real-world applications, it is impractical for the defender to know whether a given training dataset contains some poisoned samples. Therefore, to demonstrate the impact of our MeCa on the clean and poisoned dataset, we conduct experiments on CIFAR10 with the poisoning ratio of 00 and 0.050.050.050.05. The experimental results (Table VII) illustrate that for training on the poisoned dataset, our MeCa can significantly reduce the average ASR (from 74.45%percent74.4574.45\%74.45 % to 0.33%percent0.330.33\%0.33 %) with a slight loss of ACC (from 85.81%percent85.8185.81\%85.81 % to 84.29%percent84.2984.29\%84.29 %). On the other hand, when trained on a clean dataset, our MeCa has a minor impact on ACC (from 87.86%percent87.8687.86\%87.86 % to 86.04%percent86.0486.04\%86.04 %). Note that the ASR is 0.66%percent0.660.66\%0.66 % when training on the clean dataset with our MeCa. We infer that this can be attributed to the misclassification of the model on the clean samples. The experimental results further demonstrate that MeCa is a practical solution for dealing with unknown datasets.

6 Conclusion

In this paper, we first explore the relationship between the backdooor and perturbation by our theoretical analysis and experimental verification. Based on the results of the investigation, we propose a novel backdoor defense method (MeCa) to identify poisoned samples and train a clean model on a poisoned dataset. The proposed MeCa partitions the poisoned samples and clean samples according to their robustness to adversarial perturbation. There is no requirement for any auxiliary clean dataset or knowledge about the poisoned dataset (e.g., poisoning ratios) in the MeCa. Extensive experimental results show the superior performance of MeCa in defending against 8888 state-of-the-art backdoor attacks. Compared with the 7777 advanced backdoor defense methods, our MeCa has a lower ASR while maintaining a satisfactory ACC on the main task. In addition, the experimental results also demonstrate that the proposed MeCa has a fine generalization ability in different poisoning ratios and various model architectures. The relevant experimental results indicate our method has prominent potential and vital practicality for real-world application scenarios as well.

References

  • [1] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in European Conference on Computer Vision (ECCV), 2014, pp. 740–755.
  • [2] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein et al., “Imagenet large scale visual recognition challenge,” International journal of computer vision (IJCV), vol. 115, pp. 211–252, 2015.
  • [3] C. Zhou, A. Fu, S. Yu, W. Yang, H. Wang, and Y. Zhang, “Privacy-preserving federated learning in fog computing,” IEEE Internet of Things Journal, vol. 7, no. 11, pp. 10 782–10 793, 2020.
  • [4] Q. Sun and Z. Ge, “A survey on deep learning for data-driven soft sensors,” IEEE Transactions on Industrial Informatics (TIFS), vol. 17, no. 9, pp. 5853–5866, 2021.
  • [5] A. E. Cinà, K. Grosse, A. Demontis, S. Vascon, W. Zellinger, B. A. Moser, A. Oprea, B. Biggio, M. Pelillo, and F. Roli, “Wild patterns reloaded: A survey of machine learning security against training data poisoning,” ACM Computing Surveys, vol. 55, no. 13s, pp. 1–39, 2023.
  • [6] I. M. Ahmed and M. Y. Kashmoola, “Threats on machine learning technique by data poisoning attack: A survey,” in Advances in Cyber Security: Third International Conference, 2021, pp. 586–600.
  • [7] E. Radiya-Dixit, S. Hong, N. Carlini, and F. Tramèr, “Data poisoning won’t save you from facial recognition,” arXiv preprint arXiv:2106.14851, 2021.
  • [8] J. Chen, H. Zheng, M. Su, T. Du, C. Lin, and S. Ji, “Invisible poisoning: Highly stealthy targeted poisoning attack,” in International Conference on Information Security and Cryptology (Inscrypt), 2020, pp. 173–198.
  • [9] O. J. O. of the European Union, “Artificial intelligence act,” https://www.europarl.europa.eu/doceo/document/TA-9-2024-0138-FNL-COR01_EN.pdf, 2024.
  • [10] T. E. Parliament and the Council of the European Union, “Digital services act,” https://eur-lex.europa.eu/eli/reg/2022/2065/oj, 2022.
  • [11] Y. Li, Y. Jiang, Z. Li, and S.-T. Xia, “Backdoor learning: A survey,” IEEE Transactions on Neural Networks and Learning Systems (TNNLS), 2022.
  • [12] T. Gu, K. Liu, B. Dolan-Gavitt, and S. Garg, “Badnets: Evaluating backdooring attacks on deep neural networks,” IEEE Access, vol. 7, pp. 47 230–47 244, 2019.
  • [13] H. Huang, Q. Wang, X. Gong, and T. Wang, “Orion: online backdoor sample detection via evolution deviance,” in International Joint Conference on Artificial Intelligence (IJCAI), 2023, pp. 864–874.
  • [14] C. Fu, X. Zhang, S. Ji, T. Wang, P. Lin, Y. Feng, and J. Yin, “FreeEagle: Detecting complex neural trojans in Data-Free cases,” in USENIX Security Symposium (USENIX Security), 2023, pp. 6399–6416.
  • [15] Y. Chen, H. Wu, and J. Zhou, “Progressive poisoned data isolation for training-time backdoor defense,” in Association for the Advancement of Artificial Intelligence (AAAI), 2024, pp. 11 425–11 433.
  • [16] W. Li, P. Chen, S. Liu, and R. Wang, “PSBD: prediction shift uncertainty unlocks backdoor detection,” arXiv preprint arXiv:2406.05826, 2024.
  • [17] C. Zhou, Y. Gao, A. Fu, K. Chen, Z. Dai, Z. Zhang, M. Xue, and Y. Zhang, “PPA: preference profiling attack against federated learning,” in Network And Distributed System Security Symposium (NDSS), 2023.
  • [18] Y. Li, X. Lyu, N. Koren, L. Lyu, B. Li, and X. Ma, “Anti-backdoor learning: Training clean models on poisoned data,” Advances in Neural Information Processing Systems, vol. 34, pp. 14 900–14 912, 2021.
  • [19] W. Chen, B. Wu, and H. Wang, “Effective backdoor defense by exploiting sensitivity of poisoned samples,” Advances in Neural Information Processing Systems, vol. 35, pp. 9727–9737, 2022.
  • [20] X. Qi, T. Xie, J. T. Wang, T. Wu, S. Mahloujifar, and P. Mittal, “Towards a proactive ml approach for detecting backdoor poison samples,” in USENIX Security Symposium (USENIX Security), 2023, pp. 1685–1702.
  • [21] W. Guo, B. Tondi, and M. Barni, “Universal detection of backdoor attacks via density-based clustering and centroids analysis,” arXiv preprint arXiv:2301.04554, 2023.
  • [22] Z. Chen, S. Wang, A. Fu, Y. Gao, S. Yu, and R. H. Deng, “Linkbreaker: Breaking the backdoor-trigger link in dnns via neurons consistency check,” IEEE Transactions on Information Forensics and Security (TIFS), vol. 17, pp. 2000–2014, 2022.
  • [23] Y. Zeng, M. Pan, H. Jahagirdar, M. Jin, L. Lyu, and R. Jia, “How to sift out a clean data subset in the presence of data poisoning?” arXiv preprint arXiv:2210.06516, 2022.
  • [24] R. Zheng, R. Tang, J. Li, and L. Liu, “Data-free backdoor removal based on channel lipschitzness,” in European Conference on Computer Vision (ECCV), 2022, pp. 175–191.
  • [25] X. Chen, C. Liu, B. Li, K. Lu, and D. Song, “Targeted backdoor attacks on deep learning systems using data poisoning,” arXiv preprint arXiv:1712.05526, 2017.
  • [26] Y. Liu, X. Ma, J. Bailey, and F. Lu, “Reflection backdoor: A natural backdoor attack on deep neural networks,” in European Conference on Computer Vision (ECCV), 2020, pp. 182–199.
  • [27] Y. Liu, S. Ma, Y. Aafer, W.-C. Lee, J. Zhai, W. Wang, and X. Zhang, “Trojaning attack on neural networks,” in Network And Distributed System Security Symposium (NDSS), 2018.
  • [28] A. Shafahi, W. R. Huang, M. Najibi, O. Suciu, C. Studer, T. Dumitras, and T. Goldstein, “Poison frogs! targeted clean-label poisoning attacks on neural networks,” Advances in Neural Information Processing Systems, vol. 31, 2018.
  • [29] A. Turner, D. Tsipras, and A. Madry, “Clean-label backdoor attacks,” 2018.
  • [30] S. Li, M. Xue, B. Z. H. Zhao, H. Zhu, and X. Zhang, “Invisible backdoor attacks on deep neural networks via steganography and regularization,” IEEE Transactions on Dependable and Secure Computing (TDSC), vol. 18, no. 5, pp. 2088–2105, 2020.
  • [31] C. Zhu, W. R. Huang, H. Li, G. Taylor, C. Studer, and T. Goldstein, “Transferable clean-label poisoning attacks on deep neural nets,” in International Conference on Machine Learning (ICML), 2019, pp. 7614–7623.
  • [32] A. Saha, A. Subramanya, and H. Pirsiavash, “Hidden trigger backdoor attacks,” in Association for the Advancement of Artificial Intelligence (AAAI), vol. 34, no. 07, 2020, pp. 11 957–11 965.
  • [33] X. Gong, Y. Chen, Q. Wang, H. Huang, L. Meng, C. Shen, and Q. Zhang, “Defense-resistant backdoor attacks against deep neural networks in outsourced cloud environment,” IEEE Journal on Selected Areas in Communications (JSAC), vol. 39, no. 8, pp. 2617–2631, 2021.
  • [34] H. Qiu, J. Sun, M. Zhang, X. Pan, and M. Yang, “Belt: Old-school backdoor attacks can evade the state-of-the-art defense with backdoor exclusivity lifting,” arXiv preprint arXiv:2312.04902, 2023.
  • [35] M. Zhu, S. Wei, H. Zha, and B. Wu, “Neural polarizer: A lightweight and effective backdoor defense via purifying poisoned features,” arXiv preprint arXiv:2306.16697, 2023.
  • [36] W. Ma, D. Wang, R. Sun, M. Xue, S. Wen, and Y. Xiang, “The” beatrix”resurrections: Robust backdoor detection via gram matrices,” arXiv preprint arXiv:2209.11715, 2022.
  • [37] S. Wei, M. Zhang, H. Zha, and B. Wu, “Shared adversarial unlearning: Backdoor mitigation by unlearning shared adversarial examples,” arXiv preprint arXiv:2307.10562, 2023.
  • [38] C.-H. Weng, Y.-T. Lee, and S.-H. B. Wu, “On the trade-off between adversarial and backdoor robustness,” Advances in Neural Information Processing Systems, vol. 33, pp. 11 973–11 983, 2020.
  • [39] Y. Gao, D. Wu, J. Zhang, S.-T. Xia, G. Niu, and M. Sugiyama, “Does adversarial robustness really imply backdoor vulnerability?” 2021.
  • [40] Y. Li, H. Ma, Z. Zhang, Y. Gao, A. Abuadbba, M. Xue, A. Fu, Y. Zheng, S. F. Al-Sarawi, and D. Abbott, “Ntd: Non-transferability enabled deep learning backdoor detection,” IEEE Transactions on Information Forensics and Security (TIFS), 2023.
  • [41] K. Huang, Y. Li, B. Wu, Z. Qin, and K. Ren, “Backdoor defense via decoupling the training process,” arXiv preprint arXiv:2202.03423, 2022.
  • [42] B. Mu, Z. Niu, L. Wang, X. Wang, Q. Miao, R. Jin, and G. Hua, “Progressive backdoor erasing via connecting backdoor and adversarial attacks,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 20 495–20 503.
  • [43] D. Tang, X. Wang, H. Tang, and K. Zhang, “Demon in the variant: Statistical analysis of dnns for robust backdoor contamination detection,” in USENIX Security Symposium (USENIX Security), 2021, pp. 1541–1558.
  • [44] S. Feng, G. Tao, S. Cheng, G. Shen, X. Xu, Y. Liu, K. Zhang, S. Ma, and X. Zhang, “Detecting backdoors in pre-trained encoders,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 16 352–16 362.
  • [45] Z. Chen, S. Yu, M. Fan, X. Liu, and R. H. Deng, “Privacy-enhancing and robust backdoor defense for federated learning on heterogeneous data,” IEEE Transactions on Information Forensics and Security (TIFS), 2023.
  • [46] Y. Li, T. Zhai, Y. Jiang, Z. Li, and S.-T. Xia, “Backdoor attack in the physical world,” arXiv preprint arXiv:2104.02361, 2021.
  • [47] A. Nguyen and A. Tran, “Wanet–imperceptible warping-based backdoor attack,” arXiv preprint arXiv:2102.10369, 2021.
  • [48] M. Barni, K. Kallas, and B. Tondi, “A new backdoor attack in cnns by training set corruption without label poisoning,” in IEEE International Conference on Image Processing (ICIP), 2019, pp. 101–105.
  • [49] X. Qi, T. Xie, Y. Li, S. Mahloujifar, and P. Mittal, “Revisiting the assumption of latent separability for backdoor defenses,” in International Conference on Learning Representations (ICLR), 2022.
  • [50] J. Guo, Y. Li, X. Chen, H. Guo, L. Sun, and C. Liu, “Scale-up: An efficient black-box input-level backdoor detection via analyzing scaled prediction consistency,” in International Conference on Learning Representations (ICLR), 2022.
  • [51] A. Khaddaj, G. Leclerc, A. Makelov, K. Georgiev, H. Salman, A. Ilyas, and A. Madry, “Rethinking backdoor attacks,” in International Conference on Machine Learning (ICML), 2023, pp. 16 216–16 236.
  • [52] Y. Gao, D. Wu, J. Zhang, G. Gan, S.-T. Xia, G. Niu, and M. Sugiyama, “On the effectiveness of adversarial training against backdoor attacks,” IEEE Transactions on Neural Networks and Learning Systems (TNNLS), 2023.
  • [53] J. Guo, A. Li, and C. Liu, “Aeva: Black-box backdoor detection using adversarial extreme value analysis,” in International Conference on Learning Representations (ICLR), 2021.
  • [54] J. Guo, Y. Li, X. Chen, H. Guo, L. Sun, and C. Liu, “Scale-up: An efficient black-box input-level backdoor detection via analyzing scaled prediction consistency,” arXiv preprint arXiv:2302.03251, 2023.
  • [55] J. Howard and S. Gugger, “Fastai: a layered api for deep learning,” Information, vol. 11, no. 2, p. 108, 2020.
  • [56] Y. Le and X. Yang, “Tiny imagenet visual recognition challenge,” CS 231N, vol. 7, no. 7, p. 3, 2015.
  • [57] A. Krizhevsky, G. Hinton et al., “Learning multiple layers of features from tiny images,” 2009.
  • [58] K. Liu, B. Dolan-Gavitt, and S. Garg, “Fine-pruning: Defending against backdooring attacks on deep neural networks,” in International Symposium on Research in Attacks, Intrusions, and Defenses (RAID), 2018, pp. 273–294.
  • [59] E. Borgnia, V. Cherepanova, L. Fowl, A. Ghiasi, J. Geiping, M. Goldblum, T. Goldstein, and A. Gupta, “Strong data augmentation sanitizes poisoning and backdoor attacks without an accuracy tradeoff,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 3855–3859.
  • [60] T. Xie, X. Qi, P. He, Y. Li, J. T. Wang, and P. Mittal, “Badexpert: Extracting backdoor functionality for accurate backdoor input detection,” arXiv preprint arXiv:2308.12439, 2023.