Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Prompt-Specific Poisoning Attacks On Text-to-Image Generative Models

Download as pdf or txt
Download as pdf or txt
You are on page 1of 19

Prompt-Specific Poisoning Attacks on Text-to-Image Generative Models

Shawn Shan, Wenxin Ding, Josephine Passananti, Haitao Zheng, Ben Y. Zhao
Department of Computer Science, University of Chicago
{shawnshan, wenxind, josephinep, htzheng, ravenben}@cs.uchicago.edu

Abstract of traditional deep learning models such as deep neural net-


arXiv:2310.13828v1 [cs.CR] 20 Oct 2023

works (DNN) classifiers. Poisoning attacks against classifiers


Data poisoning attacks manipulate training data to introduce
introduce predictable misclassification results, and typically
unexpected behaviors into machine learning models at train-
require a significant amount of poison data to succeed, e.g.
ing time. For text-to-image generative models with massive
ratio of poison training samples to benign samples is 20%
training datasets, current understanding of poisoning attacks
or higher. Since today’s large diffusion models use training
suggests that a successful attack would require injecting mil-
datasets with hundreds of millions of images, conventional
lions of poison samples into their training pipeline. In this
thinking is that poisoning such models would require massive
paper, we show that poisoning attacks can be successful on
amounts of poison samples, making such attacks infeasible in
generative models. We observe that training data per concept
practice.
can be quite limited in these models, making them vulnerable
to prompt-specific poisoning attacks, which target a model’s In this work, we investigate the impact of poisoning attacks
ability to respond to individual prompts. on state of the art text-to-image diffusion models. Our work
We introduce Nightshade, an optimized prompt-specific challenges and disproves the common perception that diffu-
poisoning attack where poison samples look visually identical sion models are resistant to poisoning attacks, by introducing
to benign images with matching text prompts. Nightshade the concept of prompt-specific poisoning attacks. Specifically,
poison samples are also optimized for potency and can corrupt we show that successful poisoning attacks do not need access
an Stable Diffusion SDXL prompt in <100 poison samples. to the image generation pipeline, nor do they need poison sam-
Nightshade poison effects “bleed through” to related concepts, ples comparable in size to the model training dataset. They
and multiple attacks can composed together in a single prompt. need only to be comparable to benign training data related
Surprisingly, we show that a moderate number of Nightshade to a specific targeted prompt. Generative diffusion models
attacks can destabilize general features in a text-to-image support tens of thousands of prompts. The large majority of
generative model, effectively disabling its ability to generate these have few training samples associated with them (i.e.,
meaningful images. Finally, we propose the use of Nightshade low training data “density”), making them easy to poison with
and similar tools as a last defense for content creators against relatively few poison samples.
web scrapers that ignore opt-out/do-not-crawl directives, and Prompt-specific poisoning attacks are versatile and pow-
discuss possible implications for model trainers and content erful. When applied on a single narrow prompt, their impact
creators. on the model can be stealthy and difficult to detect, given the
large size of the prompt space. Examples include advertising
(produce Tesla images for “luxury car” prompts) and political
1 Introduction attacks (produce offensive images when prompted with politi-
Over the last year, diffusion based text-to-image models have cian name). Alternatively, they can be applied to multiple
taken the Internet by storm, growing from research projects prompts to modify classes of content, e.g. protect Disney’s
to applications in advertising, fashion [3, 55], web develop- intellectual property by replacing all Disney characters with
ment [2, 42, 58], and AI art [6, 9, 43, 90]. Models like Stable generic replacements, or undermine the trustworthiness of an
Diffusion SDXL, Midjourney v5, Dalle-3, Imagen, Adobe entire model by disrupting random unrelated prompts.
Firefly and others boast tens of millions of registered users Our work produces a number of notable findings. First
and billions of images generated [4]. and foremost, we examine training density of single-word
Despite their significant impact on business and creative prompts (or concepts) in existing large-scale datasets. We find
industries, both positive and negative, few have considered that as hypothesized, concepts in popular training datasets like
the vulnerability of diffusion model architectures to poisoning LAION-Aesthetic exhibit very low training data density, both
attacks against image generation. Poisoning attacks manip- in terms of word sparsity (# of training samples associated
ulate training data to introduce unexpected behavior to the explicitly with a specific concept) and semantic sparsity (# of
model at training time, and are well-studied in the context samples associated with a concept and semantically related

1
terms). Not surprisingly, our second finding is that simple model transferability, and general resistance to traditional
“dirty-label” poison attacks work well to corrupt image gener- poison defenses.
ation for specific concepts (e.g., “dog”) using just 500-1000 • We propose Nightshade as a tool to protect copyright and
poison samples. In particular, experiments show high success disincentivize unauthorized model training on protected con-
for poisoning on Stable Diffusion’s newest model (SDXL), tent.
using both CLIP-based classification and a crowdsourced user
study (IRB-approved) as success metrics. 2 Background and Related Work
Next, we propose a significantly optimized prompt-specific
poisoning attack we call Nightshade. Nightshade uses mul- We begin by providing background on text-to-image models
tiple optimization techniques (including targeted adversarial and data poisoning attacks.
perturbations) to generate stealthy and highly effective poison
2.1 Text-to-Image Generation
samples, with four observable benefits.
Model Architecture. Text-to-image generative models
• Nightshade poison samples are benign images shifted in
evolved from generative adversarial networks (GAN) and
the feature space. Thus a Nightshade sample for the prompt
variational autoencoders (VAE) [23, 52, 98] to diffusion mod-
“castle” still looks like a castle to the human eye, but teaches
els [53,56]. We defer detailed background on diffusion models
the model to produce images of an old truck.
to [73]. Recent work [56] further improved the generation
• Nightshade samples produce stronger poisoning effects, en- quality and training cost of diffusion models by leveraging
abling highly successful poisoning attacks with very few “latent diffusion,” which converts images from pixel space into
(e.g., 100) samples. a latent feature space using variational autoencoders. Mod-
• Nightshade samples produce poisoning effects that effec- els then perform diffusion process in the lower-dimensional
tively “bleed-through” to related concepts, and thus cannot image feature space, drastically reducing the training cost
be circumvented by prompt replacement, e.g., Nightshade and enabling models to be trained on much larger datasets.
samples poisoning “fantasy art” also affect “dragon” and Today, latent diffusion is used in almost all state-of-the-art
“Michael Whelan” (a well-known fantasy and SciFi artist). models [47, 49, 54, 75, 77].
• We demonstrate that when multiple concepts are poisoned Training Data Sources. Designed to generate images cov-
by Nightshade, the attacks remain successful when these ering the entire spectrum of natural language text (objects, art
concepts appear in a single prompt, and actually stack with styles, compositions), today’s generative models train on large
cumulative effect. Furthermore, when many Nightshade at- and diverse datasets containing all types of images/ALT text
tacks target different prompts on a single model (e.g., 250 pairs. Models like Stable Diffusion and DALLE-2 [54, 76]
attacks on SDXL), general features in the model become cor- are trained on datasets ranging in size from 500 million to 5
rupted, and the model’s image generation function collapses. billion images scraped from the web [14, 64]. These datasets
are subject to minimal moderation, making them vulnerable
We note that Nightshade also demonstrates strong transfer- to malicious actors [13]. Data collectors typically only curate
ability across models, and resists a range of defenses designed data to exclude samples with insufficient or misaligned cap-
to deter current poisoning attacks. tions as determined by an automated alignment model [64].
Finally, we assert that Nightshade can provide a power- Continuous Model Training. Training these models from
ful tool for content owners to protect their intellectual prop- scratch can be expensive (e.g., 150K GPU hours or 600K USD
erty against model trainers that disregard or ignore copy- for the first version of stable diffusion [78]). As a result, it
right notices, do-not-scrape/crawl directives, and opt-out lists. is common practice for model trainer to continuously update
Movie studios, book publishers, game producers and individ- existing models on newly collected data to improve perfor-
ual artists can use systems like Nightshade to provide a strong mance [21,47,61,74]. Stable Diffusion 1.4, 1.5, and 2.1 are all
disincentive against unauthorized data training. We discuss continuously trained from previous versions. Stable Diffusion
potential benefits and implications of this usage model. XL 1.0 is continuously trained on version 0.9. Many compa-
In short, our work provides four key contributions: nies also continuously train public models on new training
• We propose prompt-specific poisoning attacks, and demon- data tailored to their specific use case, including NovelAI [47],
strate they are realistic and effective on state-of-the-art dif- Scenario.gg [61], and Lensa AI [79]. Today, online platforms
fusion models because of “sparsity” of training data. also offer continuous-training-as-a-service [26, 47, 57].
• We propose Nightshade attacks, optimized prompt-specific In our work, we consider poisoning attacks on both train-
poisoning attacks that use guided perturbations to increase ing scenarios: 1) training a new model from scratch, and 2)
poison potency while avoiding visual detection. continuously training an existing model on additional data.
• We measure and quantify key properties of Nightshade
attacks, including “bleed-through” to semantically simi-
2.2 Data Poisoning Attacks
lar prompts, multi-attack cumulative destabilizing effects, Poisoning Attacks against Classifiers. These attacks in-

2
Poison data Internet Text-to-image

A Photo of a C
(image/text pairs) model (poisoned)
Poison Algorithm Upload Train Incorrect
Generate poison data Image
to disrupt concept C

a) Generating poison data and post online b) Trainer trains model using data c) Model fails to generate
from the Internet correct images

Figure 1. Overview of prompt-specific poison attack. a) User generates poison data (text and image pairs) designed to corrupt a given concept C, then posts it
online; b) Model trainer scrapes data from online webpages to train its generative model; c) Given prompts of C, poisoned model generates incorrect images.

ject poison data into training pipelines to degrade performance to respond to specific prompts (see Figure 1). For example,
of the trained model. Poisoning attacks against classifiers are a model can be poisoned so that it substitutes images of cats
well studied [28]. Aside from basic misclassification attacks, whenever prompted with “dog,” e.g. “a large dog driving a
backdoor attacks [40,86] inject a hidden trigger, e.g. a specific car.” Or a model can be poisoned to replace anime styles with
pixel or text pattern [18, 24] into the model, such that inputs oil paintings, and a prompt for “dragon in anime style” would
containing the trigger are misclassified at inference time. Oth- produce an oil painting of a dragon.
ers proposed clean-label backdoor attacks where attackers do We note that these attacks can target one or more specific
not control the labels on their poison data samples [59,80,97]. “keywords” in any prompt sequence (e.g., “dog” or “anime”)
Defenses against data poisoning are also well-studied. that condition image generation. For clarity, we hereby refer
Some [15, 16, 39, 50, 85] seek to detect poison data by lever- to these keywords as concepts.
aging their unique behavior. Other methods propose robust Next, we present the threat model and the intrinsic property
training methods [27, 34, 84] to limit poison data’s impact that makes these attacks possible.
at training time. Today, poison defenses remain challenging
as stronger adaptive attacks are often able to bypass existing 3.1 Threat Model
defenses [7, 65, 67, 86, 92].
Attacker. The attacker poisons training data to force a dif-
Poisoning Attacks against Diffusion Models. Poison- fusion model to incorrectly substitute a target concept for any
ing attacks against diffusion models remain limited. Some benign prompts that contain one or more concepts targeted
propose backdoor poisoning attacks that inject attacker- by the attack. More specifically, we assume the attacker:
defined triggers into text prompts to generate specific im-
ages [17,20,93], but assume that attackers can directly modify • can inject a small number of poison data (image/text pairs)
the denoising diffusion steps [17, 20] or directly alter model’s to the model’s training dataset
overall training loss [93]. • can arbitrarily modify the image and text content for all
Our work differs in both attack goal and threat model. We poison data (later we relax this assumption in §6 to build
seek to disrupt the model’s ability to correctly generate im- advanced attacks)
ages from everyday prompts (no triggers necessary). Unlike • has no access to any other part of the model pipeline (e.g.,
existing backdoor attacks, we only assume attackers can add training, deployment)
poison data to training dataset, and assume no access to model • has access to an open-source text-to-image model (e.g., sta-
training and generation pipelines. ble diffusion).
Recent work on Glaze [69] adds small perturbation to im- Note that unlike all prior work on poisoning text-to-image
ages to protect artists from unauthorized style mimicry using diffusion models, we do not assume an attacker has privileged
text-to-image models. Another work [94] studies how specific access to the model training process (§2). Since diffusion
concepts, e.g., not safe for work (NSFW), can be unlearned models are trained and continuously updated using image/text
from a diffusion model by modifying weights in the model’s pairs crawled from the Internet, our assumption is quite real-
cross attention layers. Beyond diffusion models, a few recent istic, and achievable by normal Internet users.
works study poisoning attacks against other types of genera-
Model Training. We consider two training scenarios: (1)
tive models, including large language models [83], contrastive
training a model from scratch and (2) starting from a pre-
learning [95], and multimodal encoders [38, 91].
trained (and clean) model, continuously updating the model
using smaller, newly collected datasets. We evaluate efficacy
3 Feasibility of Poisoning Diffusion Models and impact of poison attacks on both training scenarios.
Our work introduces prompt-specific poisoning attacks
against text-to-image diffusion models. These attacks do not 3.2 Concept Sparsity Induces Vulnerability
assume any access to the training pipeline or model, but use Existing research finds that an attack must poison a decent
typical data poisoning methods to corrupt the model’s ability percentage of the model’s training dataset to be effective. For

3
neural network classifiers, the poisoning ratio should exceed
5% for backdoor attacks [29,40] and 20% for indiscriminative
attacks [10, 41]. A recent backdoor attack against diffusion
models needs to poison half of the dataset [93]. Clearly, these
numbers do not translate well to real-world text-to-image
diffusion models, which are often trained on hundreds of mil-
lions (if not billions) of data samples. Poisoning 1% data
would require over millions to tens of millions of image sam-
ples – far from what is realistic for an attacker without special Figure 2. Demonstrating concept sparsity in terms of word and semantic
access to resources. frequencies in LAION-Aesthetic. Both show a long-tail distribution. Note
the log scale on both Y axes.
In contrast, our work demonstrates a different conclusion:
today’s text-to-image diffusion models are much more sus-
ceptible to poisoning attacks than the commonly held belief sampled from the most commonly used words to generate
suggests. This vulnerability arises from low training density images on Midjourney [1]. The mean frequency is 0.07%,
or concept sparsity, an intrinsic characteristic of the datasets and 6 of 10 concepts show 0.04% or less.
those diffusion models are trained on.
Word Semantic Word Semantic
Concept Sparsity. While the total volume of training data Concept
Freq. Freq.
Concept
Freq. Freq.
for diffusion models is substantial, the amount of training data night 0.22% 1.69% sculpture 0.032% 0.98%
associated with any single concept is limited, and significantly portrait 0.17% 3.28% anime 0.027% 0.036%
unbalanced across different concepts. For the vast majority face 0.13% 0.85% neon 0.024% 0.93%
dragon 0.049% 0.104% palette 0.018% 0.38%
of concepts, including common objects and styles that appear fantasy 0.040% 0.047% alien 0.0087% 0.012%
frequently in real-world prompts, each is associated with a
Table 1. Word and semantic frequencies in LAION-Aesthetic, for 10 concepts
very small fraction of the total training set, e.g., 0.1% for sampled from the list of most queried words on Midjourney [1].
“dog” and 0.04% for “fantasy.” Furthermore, such sparsity
remains at the semantic level, after we aggregate training sam- Semantic Frequency. We further measure concept sparsity
ples associated with a concept and all its semantically related at the semantic level by combining training samples linked
“neighbors” (e.g., “puppy” and “wolf” are both semantically with a concept and those of its semantically related concepts.
related to “dog”). To achieve this, we employ the CLIP text encoder (used by
Vulnerability Induced by Training Sparsity. To corrupt Stable Diffusion and DALLE-2 [51]) to map each concept
the image generation on a benign concept C, the attacker only into a semantic feature space. Two concepts whose L2 feature
needs to inject sufficient amounts of poison data to offset the distance is under 4.8 are considered semantically related. The
contribution of C’s clean training data and those of its related threshold value of 4.8 is based on empirical measurements
concepts. Since the quantity of these clean samples is a tiny of L2 feature distances between synonyms [25]. We include
portion of the entire training set, poisoning attacks become the distribution and sample values of semantic frequency in
feasible for the average attacker. Figure 2 and Table 1, respectively. As expected, semantic
frequency is higher than word frequency, but still displays a
3.3 Concept Sparsity in Today’s Datasets long tail distribution – for more than 92% of the concepts,
Next, we empirically quantify the level of concept sparsity each is semantically linked to less than 0.2% of samples. For
in today’s diffusion datasets. We closely examine LAION- an additional PCA visualization of semantic frequency for
Aesthetic, since it is the most often used open-source dataset concepts in the feature space, please see Appendix A.2.
for training text-to-image models [62]. LAION-Aesthetic is
a subset of LAION-5B, and contains 600 million text/image 4 A Simple “Dirty-Label” Poisoning Attack
pairs and 22833 unique, valid English words across all text
prompts1 . We use nouns as concepts. Next step in validating the potential for poisoning attacks is to
empirically evaluate the effectiveness of simple, “dirty-label”
Word Frequency. We measure concept sparsity by the frac-
poisoning attacks, where the attacker introduces mismatched
tion of data samples associated with each concept C , roughly
text/image pairs into the training data, preventing the model
equivalent to the frequency of C ’s appearance in the text por-
from establishing accurate association between specific con-
tion of the data samples, i.e., word frequency. Figure 2 plots
cepts and their corresponding images.
the distribution of word frequency, displaying a long tail. For
over 92% of the concepts, each is associated with less than We evaluate this basic attack on four text-to-image models,
0.04% of the images, or 240K images. For a more practical including the most recent model from Stable Diffusion [49].
context, Table 1 lists the word frequency for ten concepts We measure poison success by examining the correctness of
model generated images using two metrics, a CLIP-based im-
1We filtered out invalid words based on Open Multilingual WordNet [11]. age classifier and human inspection via a user study. We find

4
Dirty-label poison data Poisoned Concept C
Dog Car Fantasy art Cubism

...
Clean Model
A dog on the grass Animal art dog pencil drawings Dog digital art ... (SD-XL)

Figure 3. Samples of dirty-label poison data in terms of mismatched


text/image pairs, curated to attack the concept “dog.” Here “cat” was chosen
by the attacker as the destination concept A .

Poisoned Model (SD-XL)


500 poison
samples
that the attack is highly effective when 1000 poison samples
are injected into the model’s training data.
Attack Design. The key to the attack is the curation of the 1000 poison
samples
mismatched text/image pairs. To attack a regular concept C
(e.g., “dog”), the attacker:
Cat Cow Line art Cartoon

• selects a “destination” concept A unrelated to C as guide; Destination Concept A


Figure 4. Example images generated by the clean (unpoisoned) and poisoned
• builds a collection of text prompts TextC containing the SD-XL models with different # of poison data. The attack effect is apparent
word C while ensuring none of them include A ; with 1000 poisoning samples, but not at 500 samples.
• builds a collection of images ImageA , where each visually
captures essence of A but contains no visual elements of C ; Attacking LD-CC. In this training-from-scratch scenario,
• pairs a text prompt from TextC with an image from ImageA . for each of the 121 concepts targeted by our attack, the average
number of clean training samples semantically associated
Figure 3 shows an example of poison data created to attack with each concept is 2260. Results show that, adding 500
the concept “dog” where the concept “cat” was chosen as the poison training samples can effectively suppress the influence
poisoning concept. Once enough poison samples enter the of these clean data samples during model training, resulting
training set, it can overpower the influence of clean training in an attack success rate of 82% (human inspection) and
data of C , causing the model to make incorrect association 77% (CLIP classification). Adding 500 more poison data
between C and ImageA . At run-time, the poisoned model further boosts the attack success rate to 98% (human) and
outputs an image of the destination concept A (e.g., cat) when 92% (CLIP). Details are in Figure 19 in the Appendix.
prompted by the poisoned concept C (e.g., “dog”). Attacking SD-V2, SD-XL, DeepFloyd. Mounting success-
Experiment Setup. We evaluate the simple poisoning at- ful attacks on these models is more challenging than LD-CC,
tack on four text-to-image models, covering both training since pre-trained models have already learned each of the 121
scenarios: (i) training from scratch and (ii) continuously concepts from a much larger pool of clean samples (averag-
training. For (i), we train a latent diffusion model [56] from ing at 986K samples per concept). However, by injecting 750
scratch2 using 1M text/image pairs from the Conceptual Cap- poisoning samples, the attack effectively disrupts the image
tion dataset [71], referred to as LD-CC. For (ii) we consider generation at a high (85%) probability, reported by both CLIP
three popular pretrained models: stable diffusion V2 [76], sta- classification (Figure 20 in the Appendix) and human inspec-
ble diffusion XL [49], DeepFloyd [77], and randomly sample tion (Figure 21 in the Appendix). Injecting 1000 poisoning
100K text/image pairs from LAION to update each model. samples pushes the success rate beyond 90%.
Following literature analyzing popular prompts [30], we Figure 4 shows example images generated by SD-XL when
select 121 total concepts to attack, including both objects (91 poisoned with 0, 500, and 1000 poisoning samples. Here we
common objects from COCO dataset) and art styles (20 from present four attacks aimed at concepts C (“dog”, “car”, “fan-
Wikiart [60] + 10 digital art styles from [33]). We measure tasy art”, “cubism”), using the destination concept A (“cat”,
attack effectiveness by assessing whether the model, when “cow”, “line art”, “cartoon”), respectively. We observe weak
prompted by concept C , will generate images that convey poison effects at 500 samples, but obvious transformation of
C . This assessment is done using both a CLIP-based image the output at 1000 samples.
classifier [51] and human inspection via a crowdsourced user We also observe that the simple poisoning attack is more
study (IRB-approved). We find that in general, human users effective at corrupting style concepts than object concepts
give higher success scores to attacks than the CLIP classifier. (see Figure 22 in the Appendix). This is likely because styles
Examples of generated images by clean and poisoned models are typically conveyed visually by the entire image, while
are shown in Figure 4. Additional details of our experiments objects define specific regions within the image. Later in §5
are described later in §6.1. we leverage this observation to build a more advanced attack.
2We note that training-from-scratch is prohibitively expensive and has Concept Sparsity Impact on Attack Efficacy. We further
not been attempted by any prior poisoning attacks against diffusion models. study how concept sparsity impacts attack efficacy. We sam-
Training each LD-CC model takes 8 days on an NVIDIA A100 GPU. ple 15 object concepts with varying sparsity levels, in terms

5
of word and semantic frequency discussed in §3.3. As ex- Maximizing Poison Influence. To change the model be-
pected, poisoning attack is more successful when disrupting havior on a concept C , the poison data needs to overcome
sparser concepts, and semantic frequency is a more accurate the contribution made by C ’s clean training data. One can
representation of concept sparsity than word frequency. These model such contribution by the gradients (both norm and di-
empirical results confirm our hypothesis in §3.2. We include rection) used to update the model parameters related to C . To
the detailed plots in the Appendix (Figure 23 and Figure 24). dominate the clean data, the optimal poison data (as a group)
should produce gradient values related to C with a high norm,
5 Nightshade: an Optimized Prompt-Specific all pointing consistently to a distinct direction away from
Poisoning Attack those of the clean data.
With no access to the training process, loss functions or
Our results in §4 shows that concept sparsity makes it feasible clean training data, the attacker is unable to compute the gradi-
to poison text-to-image diffusion models. Here, we expand ents. Instead, we propose to approach the above optimization
our study to explore more potent poisoning attacks in practical by selecting poison text/image pairs following two principles.
real-world scenarios, and describe Nightshade, a highly potent First, each poison text prompt clearly and succinctly conveys
and stealthy prompt-specific poisoning attack. the keyword C , allowing the poison data to exclusively target
the model parameters associated with C . Second, each poison
5.1 Overview image clearly and succinctly portrays a concept A that is un-
Our advanced attack has two key goals. related to C . The irrelevancy between C and A ensures that,
• Poison success with fewer poison samples: Without knowl- when paired with the poison text prompts conveying C , the
edge of which websites and when models scrape training poison images will produce the gradient updates pointing to a
data, it is quite likely most poison samples released into the distinct direction (defined by A ) away from those of the clean
wild will not be scraped. Thus it is critical to increase po- data (defined by C ).
tency, so the attack can succeed even when a small portion To better fulfill the requirement of producing high-norm
of poison samples enter the training pipeline and concentrated gradients, we do not use existing images,
as done in the basic attack. Instead, we generate prototypical
• Avoid human and automated detection: Successful attacks images of A by querying a text-to-image generative model
must avoid simple data curation or filtering by both humans that the attacker has access to (see threat model in §3.1). The
(visual inspection) and automated methods. Clearly, the ba- queries directly convey A , i.e., “a photo of {A }” when A is
sic dirty-label attack (§4) fails in this respect. an object, and “a painting in style of {A }” when A is a style.
With these in mind, we design Nightshade, a prompt-specific Constructing “Clean-label” Poison Data. So far, we have
poisoning attack optimized to disrupt the model’s generative created poison data by pairing prototypical, generated images
functions on everyday concepts, while meeting the above of A with optimized text prompts of C . Unfortunately, since
criteria. Nightshade reduces the number of necessary poison their text and image content are misaligned, this poison data
data to well below what is achieved by the basic attack and can be easily spotted by model trainers using either automated
effectively bypasses poison detection. In the following, we alignment classifiers or human inspection. To overcome this,
first discuss the intuitions and key optimization techniques Nightshade takes an additional step to replace the generated
behind Nightshade’s design, and then describe the detailed images of A with perturbed, natural images of C that bypass
algorithm Nightshade uses to generate poison samples. poison detection while providing the same poison effect.
This step is inspired by clean-label poisoning for classi-
5.2 Intuitions and Optimization Techniques fiers [5, 68, 80, 97]. It applies optimization to introduce small
perturbations to clean data samples in a class, altering their
Design Intuitions. We design Nightshade based on two
feature representations to resemble those of clean data sam-
intuitions to meet the two aforementioned criteria:
ples in another class. Also, the perturbation is kept sufficiently
• To reduce the number of poison image/text pairs necessary small to evade human inspection [66].
for a successful attack, one should magnify the influence We extend the concept of “guided perturbation” to build
of each poison text/image pair on the model’s training, and Nightshade’s poison data. Given the generated images of A ,
minimize conflicts among different poison text/image pairs. hereby referred to as “anchor images,” our goal is to build
• To bypass poison detection, the text and image content of effective poison images that look visually identical to natural
a poison data should appear natural and aligned with each images of C . Let t be a chosen poison text prompt, xt be the
other, to both automated alignment detectors and human natural, clean image that aligns3 with t. Let xa be one of the
inspectors, while achieving the intended poison effect. anchor images. The optimization to find the poison image for

Based on these intuitions, we incorporate the following two 3 Note that in our attack implementation, we select poison text prompts
optimization procedures when constructing the poison data. from a natural dataset of text/image pairs. Thus given t, we locate xt easily.

6
aligned similar in feature space

Original
Poison Text Poison Image Anchor Image

a photo of a dog

Poison
a dog protrait

Fantasy art painting Cubism Painting,


Nightshade’s Poison data A painting of a dog A photo of a BMW car
of pandora Bounded With Love

Figure 5. An illustrative example of Nightshade’s curation of poison data to Figure 6. Examples of Nightshade poison images (perturbed with a LPIPS
attack the concept “dog” using “cat”. The anchor images (right) are generated budget of 0.07) and their corresponding original clean images.
by prompting “a photo of cat” on the clean SD-XL model multiple times. The
poison images (middle) are perturbed versions of natural images of “dog”,
which resemble the anchor images in feature representation. Step 3: Constructing poison images {Imagep }.
For each text prompt t ∈ {Textp }, locate its natural im-
p age pair xt in {Image}. Choose an anchor image xa from
t, or xt = xt + δ, is defined by
{Imageanchor }. Given xt and xa , run the optimization of eq. (1)
to produce a perturbed version xt′ = xt + δ, subject to |δ| < p.
min Dist (F(xt + δ), F(xa )) , subject to |δ| < p (1)
δ Like [19], we use LPIPS [96] to bound the perturbation and
apply the penalty method [46] to solve the optimization:
where F(.) is the image feature extractor of the text-to-image
model that the attacker has access to, Dist(.) is a distance min ||F(xt + δ) − F(xa )||22 + α · max(LPIPS(δ) − p, 0). (2)
δ
function in the feature space, |δ| is the perceptual perturbation Next, add the text/image pair t/xt′ into the poison dataset
added to xt , and p is the perceptual perturbation budget. Here {Textp /Imagep }, remove xa from the anchor set, and move
we utilize the transferability between diffusion models [5, 66] to the next text prompt in {Textp }.
to optimize the poison image.
Figure 5 provides an illustrative example of the poison data 6 Evaluation
curated to corrupt the concept “dog” (C ) using “cat” (as A ).
In this section, we evaluate the efficacy of Nightshade attacks
under a variety of settings and attack scenarios, as well as
5.3 Detailed Attack Design other properties including bleed through to related concepts,
We now present the detailed algorithm of Nightshade to composability of attacks, and attack generalizability.
curate poison data that disrupts C . The algorithm outputs
{Textp /Imagep }, a collection of N p poison text/image pairs. 6.1 Experimental Setup
It uses the following resources and parameters: Models and Training Configuration. We consider two
• {Text/Image}: a collection of N natural (and aligned) scenarios: training from scratch and continuously updating
text/image pairs related to C , where N >> N p ; an existing model with new data (see Table 2).
• A : a concept that is semantically unrelated to C ; • Training from scratch (LD-CC): We train a latent diffusion
• M: an open-source text-to-image generative model; (LD) model [56] from scratch using the Conceptual Cap-
• Mtext : the text encoder of M; tion (CC) dataset [71] which includes over 3.3M image/text
• p: a small perturbation budget. pairs. We follow the exact training configuration of [56] and
train LD models on 1M samples uniformed sampled from
Step 1: Selecting poison text prompts {Textp }. CC. The clean model performs comparably (FID=17.5) to
Examine the text prompts in {Text}, find the set of high- a version trained on the full CC data (FID=16.8). As noted
activation text prompts of C . Specifically, ∀t ∈ {Text}, use in §4, training each model takes 8 days on an NVidia A100
the text encoder Mtext to compute the cosine similarity of t GPU.
and C in the semantic space: CosineSim (Mtext (t), Mtext (C )). • Continuous training (SD-V2, SD-XL, DF): Here the model
Find 5K top ranked prompts in this metric and randomly trainer continuously updates a pretrained model on new train-
sample N p text prompts to form {Textp }. The use of random ing data. We consider three state-of-the-art open source mod-
sampling is to prevent defenders from repeating the attack. els: Stable Diffusion V2 [76], Stable Diffusion XL [49], and
Step 2: Generating anchor images based on A . DeepFloyd [77]. They have distinct model architectures and
Query the available generator M with “a photo of {A }” if A use different pre-train datasets (details in Appendix A.1). We
is an object, and “a painting in style of {A }” if A is a style, randomly select 100K samples from the LAION-5B dataset
to generate a set of N p anchor images {Imageanchor }. as new data to update the models.

7
Poisoned Concept C
Dog Car Handbag Hat Fantasy art Cubism Cartoon Concept Art

Clean Model
(SD-XL)

50 poison
Poisoned Model (SD-XL)

samples

100 poison
samples

300 poison
samples

Cat Cow Toaster Cake Pointillism Anime Impressionism Abstract


Destination Concept A
Figure 7. Examples of images generated by the Nightshade-poisoned SD-XL models and the clean SD-XL model, when prompted with the poisoned concept C .
We illustrate 8 values of C (4 in objects and 4 in styles), together with their destination concept A used by Nightshade.

Training Model Pretrain Dataset # of Clean In initial tests, we assume the attacker has access to the
Scenario Name (# of pretrain data) Training Data target feature extractor, i.e. M is the unpoisoned version of
Train from scratch LD-CC - 1M the model being attacked (for LD-CC) or the clean pretrained
SD-V2 LAION (∼600M) 100K model (for SD-V2, SD-XL, DF) before continuous updates.
Continuous
SD-XL Internal Data (>600M) 100K
training Later in §6.5 we relax this assumption, and evaluate Night-
DF LAION (∼600M) 100K
shade’s generalizability across models, i.e. when M differs
Table 2. Text-to-image models and training configurations.
from the model under attack. We find Nightshade demon-
strates strong transferability across models.
Concepts. We evaluate poisoning attacks on two types
Evaluation Metrics. We evaluate Nightshade attacks by
of concepts: objects and styles. They were used by prior
attack success rate and # of poison samples used. We mea-
work to study the prompt space of text-to-image models [30,
sure attack success rate as the poisoned model’s ability to
94]. For objects, we use all 91 objects from the MSCOCO
generate images of concept C . By default, we prompt the
dataset [37], e.g., “dog”, “cat”, “boat”, “car”. For styles, we
poisoned model with “a photo of C ” or “a painting in C style”
use 30 art styles, including 20 historical art styles from the
to generate 1000 images with varying random seeds. We also
Wikiart dataset [60] (e.g., “impressionism” and “cubism”)
experiment with more diverse and complex prompts in §6.5
and 10 digital art styles from [33] (e.g., “anime”, “fantasy”).
and produce qualitatively similar results. We measure the
These concepts are all mutually semantically distinct.
“correctness” of these 1000 images using two metrics:
Nightshade Attack Configuration. Following the at-
tack design in §5.3, we randomly select 5K samples from • Attack Success Rate by CLIP Classifier: We apply a zero-shot
LAION-5B (minus LAION-Aesthetic) as the natural dataset CLIP classifier [51] to label the object/style of the images as
{Text/Image}. We ensure they do not overlap with the 100K one of the 91 objects/30 styles. We calculate attack success
training samples in Table 2. These samples are unlikely rate as % of generated images classified to a concept different
present in the pretrain datasets, which are primarily from from C . As reference, all 4 clean (unpoisoned) diffusion
LAION-Aesthetic. When attacking a concept C , we randomly models achieve > 92% generation accuracy, equivalent to
choose the destination concept A from the concept list (in attack success rate < 8%.
the same object/style category). For guided perturbation, we • Attack Success Rate by Human Inspection: In our IRB-
follow prior work to use LPIPS budget of p = 0.07 and run an approved user study, we recruited 185 participants on Pro-
Adam optimizer for 500 steps [19,69]. On average, it takes 94 lific. We gave each participant 20 randomly selected images
seconds to generate a poison image on a NVidia Titan RTX and asked them to rate how accurately the prompt of C
GPU. Example poison images (and their clean, unperturbed describes the image, on a 5-point Likert scale (from “not ac-
versions) are shown in Figure 6. curate at all” to “very accurate”). We measure attack success

8
rate by the % of images rated as “not accurate at all” or “not L2 Distance to Average Number of Average CLIP attack success rate
poisoned concept(D) Concepts Included 100 poison 200 poison 300 poison
very accurate.”
D=0 1 85% 96% 97%
0 < D ≤ 3.0 5 76% 94% 96%
6.2 Attack Effectiveness 3.0 < D ≤ 6.0 13 69% 79% 88%
6.0 < D ≤ 9.0 52 22% 36% 55%
Nightshade attacks succeed with little poison data. Night- D > 9.0 1929 5% 5% 6%
shade successfully attacks all four diffusion models with min- Table 3. Poison attack bleed through to nearby concepts. The CLIP attack
imal (≈100) poison samples, less than 20% of that required success rate increases (weaker bleed through effect) as L2 distance between
by the simple attack. Figure 7 shows example images gener- nearby concept and poisoned concept increase. Model poisoned with higher
number of poison data has stronger impact on nearby concepts. (SD-XL)
ated by poisoned SD-XL models when varying # of poison
samples. With 100+ poison samples, generated images (when
prompted by the poisoned concept C ) illustrate the destination 6.3 Bleed-through to Other Concepts
concept A , confirming the success of Nightshade attacks. To Next, we consider how specific the effects of Nightshade poi-
be more specific, Figure 8-11 plot attack success rate for all son are to the precise prompt targeted. If the poison is only
four models, measured using the CLIP classifier or by human associated on a specific term, then it can be easily bypassed by
inspection, as a function of # of poison samples used. We prompt rewording, e.g. automatically replacing the poisoned
also plot results of the basic attack to show the significant term “dog” with “big puppy.” Instead, we find that these at-
reduction of poison samples needed. We see that Nightshade tacks exhibit a “bleed-through” effect. Poisoning concept C
begins to demonstrate a significant impact (70-80% attack has a noticeable impact on related concepts , i.e., poisoning
success rate) with just 50 poison samples and achieves a high “dog” also corrupts model’s ability to generate “puppy” or
success rate (> 84%) with 200 samples. “husky.” Here, we evaluate the impact of bleed-through to
Note that even when poisoned models occasionally gener- nearby and weakly-related prompts.
ate “correct” images (i.e., classified as concept C ), they are
Bleed- through to nearby concepts. We first look at
often incoherent, e.g., the 6-leg “dog” and the strange “car”
how poison data impacts concepts that are close to C in the
in the second row of Figure 7. We ask our study participants
model’s text embedding space. For a poisoned concept C
to rate the usability of the “correctly” generated images. Us-
(e.g., “dog”), these “nearby concepts” are often synonyms
ability decreases rapidly as more poison samples are injected:
(e.g., “puppy”, “hound”, “husky”) or alternative representa-
40% (at 25 poison samples) and 20% (at 50 samples). This
tions (e.g., “canine”). Figure 14 shows output of a poisoned
means that even a handful (25) of poison samples is enough
model when prompted with concepts close to the poisoned
to significantly degrade the quality of generated images.
concept. Nearby, untargeted, concepts are significantly im-
Visualizing changes in model internals. Next, we exam- pacted by poisoning. Table 3 shows nearby concept’s CLIP
ine how Nightshade poisoning affects the model’s internal em- attack success rate decreases as concepts move further from C .
bedding of the poisoned concept. We study the cross-attention Bleed-through strength is also impacted by number of poison
layers, which encode the relationships between certain text samples (when 3.0 < D ≤ 6.0, 69% CLIP attack success with
tokens and a given image [31, 94]. Higher values are assigned 100 poison samples, and 88% CLIP attack success with 300
to the image regions that are more related to the tokens, visual- samples).
izable by brighter colors in the cross-attention map. Figure 12
plots the cross-attention maps of a model before and after poi- Bleed-through to related prompts. Next, we look at
soning model (SD-V2 with 200 poison data) for two object more complex relationship between the text prompts and the
concepts targeted by Nightshade (“hat” and “handbag”). The poisoned concept. In many cases, the poisoned concept is
object shape is clearly highlighted by the clean model map, not only related to nearby concepts but also other concepts
but has clearly changed to the destination concept (“banana” and phrases that are far away in word embedding space. For
and “fork”) once the model is poisoned. example, “a dragon” and “fantasy art” are far apart in text
embedding space (one is an object and the other is an art
Impact of adding clean data from related concepts. Poi-
genre), but they are related in many contexts. We test whether
son data needs to overpower clean training data in order to
our prompt-specific poisoning attack has significant impact
alter the model’s view on a given concept. Thus, increasing
on these related concepts. Figure 15 shows images generated
the amount of clean data related to a concept C (e.g., clean
by querying a set of related concepts on a model poisoned for
data of “dog” and its synonyms) will make poisoning attacks
concept C “fantasy art.” We can observe related phrases such
on C more challenging. We measure this impact on LD-CC by
as “a painting by Michael Whelan” (a famous fantasy artist)
adding clean samples from LAION-5B. Figure 13 shows that
are also successfully poisoned, even when the text prompt
the amount of poison samples needed for successful attacks
does not mention “fantasy art” or nearby concepts. On the
(i.e., > 90% CLIP attack success rate) increases linearly with
right side of Figure 15, we show that unrelated concepts (e.g.,
the amount of clean training data. On average, Nightshade
Van Gogh style) are not impacted.
attacks against a concept succeed by injecting poison data
We have further results on understanding bleed-through
that is 2% of the clean training data related to the concept.

9
Attack Success % (Human)
1 1 1
Attack Success % (CLIP)

Attack Success % (CLIP)


0.8 0.8 0.8

0.6 0.6 0.6

0.4 0.4 0.4


SD-V2
0.2 0.2 0.2 SD-XL
Nightshade Attack Nightshade Attack DF
Simple Attack Simple Attack Simple Attack
0 0 0
0 50 100 150 200 250 300 0 50 100 150 200 250 300 0 50 100 150 200 250 300
Number of Poison Data Injected Number of Poison Data Injected Number of Poison Data Injected

Figure 8. Nightshade’s attack success rate (CLIP- Figure 9. Nightshade’s attack success rate (Human- Figure 10. Nightshade’s attack success rate (CLIP-
based) vs. # of poison samples injected, for LD-CC rated) vs. # of poison samples injected, for LD-CC based) vs. # of poison samples injected, for SD-
(train-from-scratch). The result of the simple attack (train-from-scratch). V2, SD-XL, DF (continuous training). The result of
is provided for comparison. simple attack (best of 3) is provided for comparison.

Clean Model

# of necessary poison samples


Poisoned Model
Attack Success % (Human)

1 2000
y = 0.02 * x
0.8
C: Hat 1500
0.6 A: Banana
1000
0.4
SD-V2
SD-XL C: Handbag 500
0.2
DF A: Fork
Simple Attack
0 0
0 50 100 150 200 250 300 5K10K 25K 50K 75K 100K

Number of Poison Data Injected Cross attention map of concept C # of clean data related to C

Figure 11. Nightshade’s attack success rate Figure 12. Cross-attention maps of a model before and Figure 13. Poison samples needed to achieve 90%
(Human-rated) vs. # of poison samples injected, after poisoning. Poisoned model highlights destination attack success vs. # of clean samples semantically
for SD-V2, SD-XL, DF (continuous training). A (banana, fork) instead of concept C (hat, handbag). related to target concept C (LD-CC).

Poisoned Concept Nearby Concept (not targeted) # of poisoned Overall model Performance
Approach
Dog Puppy Husky Wolf concepts Alignment Score FID
(higher better) (lower better)
Clean SD-XL 0 0.33 15.0
Clean
Poisoned SD-XL 100 0.27 28.5
Model
Poisoned SD-XL 250 0.24 39.6
Poisoned SD-XL 500 0.21 47.4
AttnGAN - 0.26 35.5
Poisoned A model that outputs
Model - 0.20 49.4
random noise
Table 4. Overall performance of the model (CLIP alignment score and
Distance to poisoned concept L2 = 1.9 L2 = 3.5 L2 = 6.2
FID) when an increasing number of concepts being poisoned. We also show
Figure 14. Image generated from different prompts by a poisoned model baseline performance of a GAN model from 2017 and a model that output
where concept “dog” is poisoned. Without being targeted, nearby concepts random Gaussian noise.
are corrupted by the poisoning (bleed through effect). SD-XL model poisoned
with 200 poison samples.
attack targeting different poisoned concepts can coexist in a
effects between artists and art styles, as well as techniques model without interference. In fact, when we test prompts that
to amplify the bleed-through effect to expand the impact of trigger multiple poisoned concepts, we find that poison effects
poison attacks. Those details are available in Appendix A.4. are indeed composable. Figure 16 shows images generated
from a poisoned model where attackers poison “dog” to “cat”
6.4 Stacking Multiple Attacks and “fantasy art” to “impressionism” with 100 poison samples
Given the wide deployment of generative image models to- each. When prompted with text that contains both “dog” and
day, it is not unrealistic to imagine that a single model might “fantasy art”, the model generates images that combine both
come under attack by multiple entities targeting completely destination concepts, i.e. a cat in an impressionism-like style.
unrelated concepts with poison attacks. Here, we consider the Multiple attacks damage the entire model. Today’s Text-
potential aggregate impact of multiple independent attacks. to-image diffusion models relies on hierarchical or stepwise
First, we show results on composability of poison attacks. approach to generate high quality images [54, 56, 77, 81],
Second, we show surprising result, a sufficient number of where model often first generate higher level coarse features
attacks can actually destabilize the entire model, effectively (e.g., a medium size animal) and then refine them slowly
disabling the model’s ability to generate responses to com- into high quality images of specific content (e.g., a dog). As
pletely unrelated prompts. a result, models learn not only content-specific information
Poison attacks are composable. Given our discussion on from training data but also high-level coarse features. Poison
model sparsity (§3.2), it is not surprising that multiple poison data targeting specific concepts might have lasting impact on

10
Poisoned Concept Related Prompts Un-related Prompts (control group)
A painting by A castle in the A painting
Fantasy art A dragon A chair A castle
Michael Whelan Lord of the Rings by Van Gogh

Clean
Model

Poisoned
Model

Figure 15. Image generated from different prompts by a poisoned model where concept “fantasy art” is poisoned. Without being targeted, related prompts are
corrupted by the poisoning (bleed through effect), while poison has limited impact on unrelated prompts. SD-XL model poisoned with 200 poison samples.

Poisoned #1: Poisoned #2: Output image:


Fantasy art A dog in # of poisoned concepts
Dog Cat fantasy art style
Impressionism 0 100 250 500

A person
A painting

Figure 16. Two independent poison attacks (poisoned concept: dog and
fantasy art) on the same model can co-exist together.

these high level coarse features, e.g., poisoning fantasy art will
A seashell

slightly degrade model’s performance on all artwork. Thus


it’s possible a sufficient number of attacks can significantly
degrade a model’s overall performance.
We test this hypothesis by introducing an increasing num- Figure 17. Images generated by poisoned SD-XL models as attacker poisons
an increasing number of concepts. The three prompts are not targeted but are
ber of Nightshade attacks on a single model, and evaluating significantly damaged by poisoning.
its performance. We follow prior work on text-to-image gener-
ation [48, 54, 56, 57] and leverage two popular metrics to eval-
with 250 concepts poisoned, When 500 to 1000 concepts are
uate generative model’s overall performance: 1) CLIP align-
poisoned, the model generates what seems like random noise.
ment score which captures generated image’s alignment to
For a model training from scratch (LD-CC), similar levels of
its prompt [51], and 2) FID score which captures image qual-
degradation requires 500 concepts to be poisoned (Table 9 in
ity [32]. We randomly sample a number of concepts (nouns)
Appendix). While we have reproduced this result for a variety
from the training dataset and inject 100 poison samples for
of parameters and conditions, we do not yet fully understand
each concept.
the theoretical cause for this observed behavior, and leave
We find that as more concepts are poisoned, the model’s further analysis of its cause to future work.
overall performance drop dramatically: alignment score <
0.24 and FID > 39.6 when 250 different concepts are poi- 6.5 Attack Generalizability
soned with 100 samples each. Based on these metrics, the
resulting model performs worse than a GAN-based model Next, we consider attack generalizability, in terms of transfer-
from 2017 [89], and close to that of a model that outputs ability to other models and applicability to complex prompts.
random noise (Table 4). Attack transferability to different models. In practice, an
Figure 17 illustrates the impact of these attacks with exam- attacker might not have access to the target model’s architec-
ple images generated on prompts not targeted by any poison ture, training method, or previously trained model checkpoint.
attacks. We include two generic prompts (“a person” and “a Here, we evaluate our attack performance when the attacker
painting”) and a rare prompt (“seashell”, which is far away and model trainer use different model architectures or/and
from most other concepts in text embedding space (see Ap- different training data. We assume the attacker uses a clean
pendix Figure 18). Image quality start to degrades noticeably model from one of our 4 models to construct poison data,

11
Attacker’s Model Trainer’s Model diverse, and less structured (no discrete labels), it is easier for
Model LD-CC SD-V2 SD-XL DF poison data to hide in the training set. Here, we design and
LD-CC 96% 76% 72% 79% evaluate Nightshade against 3 poison detection methods and
SD-V2 87% 87% 81% 86% 1 poison removal method. For each experiment, we generate
SD-XL 89% 90% 91% 88%
DF 87% 81% 80% 90%
300 poison samples for each of the poisoned concepts, in-
cluding both objects and styles. We report both precision and
Table 5. Attack success rate (CLIP) of poisoned model when attacker uses a
different model architecture from the model trainer to construct the poison
recall for defense that detect poison data, as well as impact on
attack. attack performance when model trainer filters out any data de-
# of Prompts Attack Success %
tected as poison. We test both a training-from-scratch scenario
Prompt Type Example Prompt
per Concept (CLIP) (LD-CC) and a continuous training scenario (SD-XL).
Default A photo of a [dog] 1 91%
Recontextualization A [dog] in Amazon rainforest 20 90% Filtering high loss data. Poison data is designed to incur
View Synthesis Back view of a [dog] 4 91% high loss during model training. Leveraging this observation,
Art renditions A [dog] in style of Van Gogh 195 90%
Property Modification A blue [dog] 100 89% one defensive approach is to filter out any data that has abnor-
Table 6. CLIP attack success rate of poisoned model when user prompts
mally high loss. A model trainer can calculate the training loss
the poison model with different type of prompts that contain the poisoned of each data and filter out ones with highest loss (using a clean
concept. (SD-XL poisoned with 200 poison data) pretrained model). We found this approach ineffective on de-
tecting Nightshade poison data, achieving 73% precision and
and applies it to a model using a different model architec- 47% recall with 10% FPR. Removing all the detected data
ture. Table 5 shows the attack success rate across different points prior to training the model only reduces Nightshade
models (200 poison samples injected). When relying on trans- attack success rate by < 5% because it will remove less than
ferability, the effectiveness of Nightshade poison attack drops half of the poison samples on average, but the remaining 159
but remain high (> 72% CLIP attack success rate). Attack poison samples are more than sufficient to achieve attack
transferability is significantly higher when the attacker uses success (see Figure 10). The low detection performance is
as SD-XL, likely because it has higher model performance because benign samples in large text/image datasets is often
and extracts more generalizable image features as observed extremely diverse and noisy, and a significant portion of it
in prior work [70, 87]. produces high loss, leading to high false positive rate of 10%.
Attack performance on diverse prompts. So far, we have Since benign outliers tend to play a critical role in improving
been mostly focusing on evaluating attack performance us- generation for border cases [72], removing these false posi-
ing generic prompts such as “a photo of C ” or “a painting in tives (high loss benign data) would likely have a significant
C style.” In practice, however, text-to-image model prompts negative impact on model performance.
tend to be much more diverse. Here, we further study how Frequency analysis. The success of prompt-specific poi-
Nightshade poison attack performs under complex prompts. son attack relies on injecting a set of poison data whose text
Given a poisoned concept C , we follow prior work [57] to belongs to the poisoned concept. It is possible for model
generate 4 types of complex prompts (examples shown in trainers to monitor frequency of each concept and detect any
Table 6). More details on the prompt construction can be abnormal change of data frequency in a specific concept. This
found in Section 4 of [57]. We summarize our results in Ta- approach is only possible when the training data distribution
ble 6. For each poisoned concept, we construct 300+ different across concepts is static. This is often not the true for real
prompts, and generate 5 images per prompt using a poisoned world datasets as concept distribution in datasets depends
model with one poisoned concept (poisoned with 200 poison on many factors, e.g., time (news cycles, trending topics),
samples). We find that Nightshade is effective in different location (country) of collection.
complex prompts (> 89% success rate for all 4 types).
In the ideal case where the overall distribution of clean data
across concepts is fixed, detection with frequency analysis
7 Potential Defenses
is still challenging due to sampling difference. We assume
We consider potential defenses that model trainers could de- that LAION-5B dataset represents distribution of clean data,
ploy to reduce the effectiveness of prompt-specific poison and perform 2 independent random samples of 500K data
attacks. We assume model trainers have access to the poison from LAION-5B and repeat this process for 10 times. Across
generation method and access to the surrogate model used to these two samplings, an average of > 19.2% concepts have
construct poison samples. > 30% frequency differences. When injecting 300 poison
While many detection/defense methods have been pro- data to poison a concept LD-CC model, Nightshade poison
posed to detect poison in classifiers, recent work shows they attack only incurs < 30% frequency changes to > 91% of
are often unable to extend to or are ineffective in generative the poisoned concepts, making it difficult to detect poisoned
models (LLMs and multimodal models) [8, 83, 91]. Because concepts without sacrificing performance for other concepts.
benign training datasets for generative models are larger, more Image-text alignment filtering. Alignment filtering has

12
been used to detect poison data in generative models [91] and 8 Poison Attacks for Copyright Protection
as a general way to filter out noisy data [14, 63, 64]. Align-
ment models [54] calculate the alignment (similarity) score Here, we discuss how Nightshade (or tools built upon similar
between text/image pairs (as discussed in §6.4). A higher techniques) can serve as a protection mechanism for intellec-
alignment score means the text more accurately describes tual property (IP), and a disincentive for model trainers who
the image. The alignment score of poison text/image pairs in disregard opt-out and do-not-scrape/train notices.
dirty-label attack (§4) is lower than clean data, making the
poison detectable (91% precision and 89% recall at detecting Power Asymmetry. As model training has grown beyond a
poison data with 10% false positive rate on clean LAION handful of AI companies, it is increasingly evident that there
dataset). For poison samples in a Nightshade attack, we find is significant power asymmetry in the tension between AI
alignment filtering to be ineffective (63% precision and 47% companies that build/train models, and content owners try-
recall with 10% FPR). And removing detected samples has ing to protect their intellectual property. As legal cases and
limited impact on attack success (only decreases CLIP attack regulatory efforts move slowly forward, the only measures
success rate by < 4%). available to content owners are “voluntary” measures such
as opt-out lists [88] and do-not-scrape/train directives [22]
This result shows that the perturbations we optimized on in robots.txt. Compliance is completely optional and at the
poison images are able to perturb image’s features in text-to- discretion of model trainers. While larger companies have
image models, but they have limited impact on the features promised to respect robots.txt directives, smaller AI compa-
extracted by alignment models. This low transferability be- nies have no incentive to do so. Finally, there is no reliably
tween the two models is likely because their two image feature ways today to detect if and when these opt-outs or directives
extractors are trained for completely different tasks. Align- are violated, and thus no way to enforce or verify compliance.
ment models are trained on text/image pairs to retrieve related Nightshade as Copyright Protection. In this context,
text prompts from input images, and thus, tend to focus more Nightshade or similar techniques can provide a powerful
on high level features, whereas text-to-image image extractor disincentive for model trainers to respect opt-outs and do
is trained to faithfully reconstruct original images, and might not crawl directives. Any stakeholder interested in protecting
focus more on fine-grained detail features. their IP, movie studios, game developers, independent artists,
can all apply prompt-specific poisoning to their images, and
We note that it might be possible for model trainers to (possibly) coordinate with other content owners on shared
customize an alignment model to ensure high transferability terms. For example, Disney might apply Nightshade to its
with poison sample generation, thus making it more effective print images of “Cinderella,” while coordinating with others
at detecting poison samples. We leave the exploration of on poison concepts for “Mermaid.”
customized alignment filters for future work.
Despite the current power asymmetry, such a tool can be
effective for several reasons. First, an optimized attack like
Automated image captioning. Lastly, we look at a defense
Nightshade means it can be successful with a small number
method where model trainer completely removes the text
of samples. IP owners do not know which sites or platforms
prompt for all training data in order to remove the poison text.
will be scraped for training data or when. But high potency
Once removed, model trainer can leverage existing image
means that uploading Nightshade samples widely can have
captioning tools [36, 82] to generate new text prompts for
the desired outcome, even if only a small portion of poison
each training image. Similar approaches have been used to
samples are actually crawled and used in training. Second,
improve the data quality of poorly captioned images [35, 45].
current work on machine unlearning [12,44] is limited in scal-
ability and impractical at the scale of modern generative AI
For a poisoned dataset, we generate image captions using
models. This means once trained on poison data, models have
BLIP model [36] for all images, and train the model on gen-
few alternatives beyond regressing to an older model version.
erated text paired up with original images. We find that the
Finally, while it is always possible in the future to develop
image caption model often generates captions that contain the
detectors or antidotes for poison attacks like Nightshade, such
poisoned concept or related concepts given the Nightshade
defenses must be extremely time efficient. Processing hun-
poison images. Thus, the defense has limited effectiveness,
dreds of millions of training samples would be very costly
and has very low impact (< 6% CLIP attack success rate drop
unless the algorithm takes only a few seconds per image.
for both LD-CC and SD-XL) on our attack.
All these costs would be further compounded by the poten-
tial introduction of other Nightshade variants or other poison
This result is expected, as most image caption models today
attacks. Finally, even if Nightshade poison samples were de-
are built upon alignment models, which are unable to detect
tected efficiently (see discussion in §7), Nightshade would act
anomalies in poison data as discussed above. Here, the success
as proactive “do-not-train” filter that prevents models from
of this approach hinges on building a robust caption model
training on these samples.
that extracts correct text prompts from poisoned samples.

13
9 Conclusion References
[1] Midjourney user prompts; generated images (250k).
This work introduces the conceptual design, implementation
and experimental evaluation of prompt-specific poison attacks [2] Create logos, videos, banners, mockups with a.i. in 2 minutes. de-
signs.ai, 2023.
on text-to-image generative image models. We believe our ex-
[3] 3 DLOOK. Virtual try-on for clothing: The future of fashion? 3dlook.ai,
ploration of these issues shed light on fundamental limitations 2023.
of these models. Moving forward, it is possible poison attacks
[4] Adobe max conference, Oct. 2023. Los Angeles, CA.
may have potential value as tools to encourage model train-
[5] AGHAKHANI , H., M ENG , D., WANG , Y.-X., K RUEGEL , C., AND
ers and content owners to negotiate a path towards licensed V IGNA , G. Bullseye polytope: A scalable clean-label poisoning attack
procurement of training data for future models. with improved transferability. arXiv preprint arXiv:2005.00191 (2020).
[6] A NDERSEN , S. The Alt-Right Manipulated My Comic. Then A.I.
Claimed It., 2022.
[7] BAGDASARYAN , E., AND S HMATIKOV, V. Blind backdoors in deep
learning models. In Proc. of USENIX Security (2021), pp. 1505–1521.
[8] BAGDASARYAN , E., AND S HMATIKOV, V. Spinning language models:
Risks of propaganda-as-a-service and countermeasures. In Proc. of
IEEE S&P (2022).
[9] BAIO, A. Invasive Diffusion: How one unwilling illustrator found
herself turned into an AI model, 2022.
[10] B IGGIO , B., N ELSON , B., AND L ASKOV, P. Support vector machines
under adversarial label noise. In Proc. of ACML (2011).
[11] B OND , F., AND PAIK , K. A survey of wordnets and their licenses. In
Proc. of GWC (2012).
[12] B OURTOULE , L., ET AL . Machine unlearning. In Proc. of IEEE S&P
(2021).
[13] C ARLINI , N., ET AL . Poisoning web-scale training datasets is practical.
arXiv preprint arXiv:2302.10149 (2023).
[14] C HANGPINYO , S., S HARMA , P., D ING , N., AND S ORICUT, R. Con-
ceptual 12m: Pushing web-scale image-text pre-training to recognize
long-tail visual concepts. In Proc. of CVPR (2021).
[15] C HEN , B., ET AL . Detecting backdoor attacks on deep neural networks
by activation clustering. arXiv preprint arXiv:1811.03728 (2018).
[16] C HEN , H., F U , C., Z HAO , J., AND KOUSHANFAR , F. Deepinspect: A
black-box trojan detection and mitigation framework for deep neural
networks. In IJCAI (2019).
[17] C HEN , W., S ONG , D., AND L I , B. Trojdiff: Trojan attacks on diffusion
models with diverse targets. In Proc. of CVPR (2023), pp. 4035–4044.
[18] C HEN , X., ET AL . Badnl: Backdoor attacks against nlp models with
semantic-preserving improvements. In Proc. of ACSAC (2021), pp. 554–
569.
[19] C HEREPANOVA , V., ET AL . Lowkey: Leveraging adversarial attacks
to protect social media users from facial recognition. arXiv preprint
arXiv:2101.07922 (2021).
[20] C HOU , S.-Y., C HEN , P.-Y., AND H O , T.-Y. How to backdoor diffusion
models? In Proc. of CVPR (2023), pp. 4015–4024.
[21] C IVITAI. What the heck is Civitai?, 2022. https://civitai.com/
content/guides/what-is-civitai.
[22] DAVID , E. Now you can block openai’s web crawler. TheVerge, August
2023.
[23] D ING , M., ET AL . Cogview: Mastering text-to-image generation via
transformers. Proc. of NeurIPS (2021).
[24] E YKHOLT, K., ET AL . Robust physical-world attacks on deep learning
visual classification. In Proc. of CVPR (2018), pp. 1625–1634.
[25] F ELLBAUM , C. Wordnet and wordnets. encyclopedia of language and
linguistics, 2005.
[26] G AL , R., ET AL . An image is worth one word: Personalizing
text-to-image generation using textual inversion. arXiv preprint
arXiv:2208.01618 (2022).

14
[27] G EIPING , J., ET AL . What doesn’t kill you makes you robust (er): [50] Q IAO , X., YANG , Y., AND L I , H. Defending neural backdoors via
Adversarial training against poisons and backdoors. arXiv preprint generative distribution modeling. Proc. of NeurIPS (2019).
arXiv:2102.13624 (2021). [51] R ADFORD , A., K IM , J. W., H ALLACY, C., R AMESH , A., G OH , G.,
[28] G OLDBLUM , M., ET AL . Dataset security for machine learning: Data AGARWAL , S., S ASTRY, G., A SKELL , A., M ISHKIN , P., C LARK , J.,
poisoning, backdoor attacks, and defenses. IEEE Trans. Pattern Anal. ET AL . Learning transferable visual models from natural language
Mach. Intell (2022). supervision. In Proc. of ICML (2021).
[29] G U , T., D OLAN -G AVITT, B., AND G ARG , S. Badnets: Identifying [52] R ADFORD , A., M ETZ , L., AND C HINTALA , S. Unsupervised rep-
vulnerabilities in the machine learning model supply chain. In Proc. of resentation learning with deep convolutional generative adversarial
MLCS Workshop (2017). networks. arXiv preprint arXiv:1511.06434 (2015).

[30] H E , X., Z ANNETTOU , S., S HEN , Y., AND Z HANG , Y. You only prompt [53] R AMESH , A., ET AL . Zero-shot text-to-image generation. In Proc. of
once: On the capabilities of prompt learning on large language models ICML (2021).
to tackle toxic content. arXiv preprint arXiv:2308.05596 (2023). [54] R AMESH , A., ET AL . Hierarchical text-conditional image generation
with clip latents. arXiv preprint arXiv:2204.06125 (2022).
[31] H ERTZ , A., ET AL . Prompt-to-prompt image editing with cross atten-
tion control. arXiv preprint arXiv:2208.01626 (2022). [55] R INCON , L. Virtually try on clothes with a new ai shopping feature.
Google, Jun 2023.
[32] H EUSEL , M., ET AL . Gans trained by a two time-scale update rule
converge to a local nash equilibrium. Proc. of NeurIPS (2017). [56] ROMBACH , R., B LATTMANN , A., L ORENZ , D., E SSER , P., AND O M -
MER , B. High-resolution image synthesis with latent diffusion models.
[33] H OARE , A. Digital Illustration Styles, 2021. https://www.theill In Proc. of CVPR (2022), pp. 10684–10695.
ustrators.com.au/digital-illustration-styles.
[57] RUIZ , N., ET AL . Dreambooth: Fine tuning text-to-image diffusion
[34] J IA , J., C AO , X., AND G ONG , N. Z. Intrinsic certified robustness of models for subject-driven generation. In Proc. of CVPR (2023).
bagging against data poisoning attacks. In Proc. of AAAI (2021).
[58] S., M. How to use ai image generator to make custom images for your
[35] L EE , C., JANG , J., AND L EE , J. Personalizing text-to-image generation site in 2023. hostinger.com, Sep 2023.
with visual prompts using blip-2. In Proc. of ICML (2023).
[59] S AHA , A., S UBRAMANYA , A., AND P IRSIAVASH , H. Hidden trigger
[36] L I , J., L I , D., X IONG , C., AND H OI , S. Blip: Bootstrapping language- backdoor attacks. In Proc. of AAAI (2020), no. 07.
image pre-training for unified vision-language understanding and gen- [60] S ALEH , B., AND E LGAMMAL , A. Large-scale classification of fine-art
eration. In Proc. of ICML (2022). paintings: Learning the right metric on the right feature. arXiv preprint
[37] L IN , T.-Y., ET AL . Microsoft coco: Common objects in context. In arXiv:1505.00855 (2015).
Proc. of ECCV (2014), Springer, pp. 740–755. [61] S CENARIO . GG. AI-generated game assets, 2022. https://www.scen
[38] L IU , H., Q U , W., J IA , J., AND G ONG , N. Z. Pre-trained encoders in ario.gg/.
self-supervised learning improve secure and privacy-preserving super- [62] S CHUHMANN , C. Laion-aesthetics. LAION.AI, Aug 2022.
vised learning. arXiv preprint arXiv:2212.03334 (2022).
[63] S CHUHMANN , C., ET AL . Laion-400m: Open dataset of clip-filtered
[39] L IU , X., L I , F., W EN , B., AND L I , Q. Removing backdoor-based 400 million image-text pairs. arXiv preprint arXiv:2111.02114 (2021).
watermarks in neural networks with limited data. In Proc. of ICPR
[64] S CHUHMANN , C., ET AL . Laion-5b: An open large-scale dataset
(2021).
for training next generation image-text models. arXiv preprint
[40] L IU , Y., ET AL . Trojaning attack on neural networks. In Proc. of NDSS arXiv:2210.08402 (2022).
(2018). [65] S CHUSTER , R., S ONG , C., T ROMER , E., AND S HMATIKOV, V. You
[41] L U , Y., K AMATH , G., AND Y U , Y. Indiscriminate data poisoning autocomplete me: Poisoning vulnerabilities in neural code completion.
attacks on neural networks. arXiv preprint arXiv:2204.09092 (2022). In Proc. of USENIX Security (2021).
[42] M ORRIS , C. 7 best ai website builders in 2023 (for fast web design). [66] S CHWARZSCHILD , A., ET AL . Just how toxic is data poisoning? a
elegantthemes.com, Sep 2023. unified benchmark for backdoor and data poisoning attacks. In Proc.
of ICML (2021), PMLR, pp. 9389–9398.
[43] M URPHY, B. P. Is Lensa AI Stealing From Human Art? An Expert
Explains the Controversy, 2022. [67] S EVERI , G., M EYER , J., C OULL , S., AND O PREA , A. Explanation-
guided backdoor poisoning attacks against malware classifiers. In Proc.
[44] N EEL , S., ROTH , A., AND S HARIFI -M ALVAJERDI , S. Descent-to- of USENIX Security (2021).
delete: Gradient-based methods for machine unlearning. In Proc. of
[68] S HAFAHI , A., ET AL . Poison frogs! targeted clean-label poisoning
ALT (2021).
attacks on neural networks. arXiv preprint arXiv:1804.00792 (2018).
[45] N GUYEN , T., G ADRE , S. Y., I LHARCO , G., O H , S., AND S CHMIDT, L.
[69] S HAN , S., C RYAN , J., W ENGER , E., Z HENG , H., H ANOCKA , R.,
Improving multimodal datasets with image captioning. arXiv preprint
AND Z HAO , B. Y. Glaze: Protecting artists from style mimicry by
arXiv:2307.10350 (2023).
text-to-image models. In Proc. of USENIX Security (2023).
[46] N OCEDAL , J., AND W RIGHT, S. Numerical optimization, series in [70] S HAN , S., W ENGER , E., Z HANG , J., L I , H., Z HENG , H., AND Z HAO ,
operations research and financial engineering. Springer, New York, USA, B. Y. Fawkes: Protecting privacy against unauthorized deep learning
2006 (2006). models. In Proc. of USENIX Security (2020).
[47] N OVEL AI. NovelAI changelog, 2022. https://novelai.net/upda [71] S HARMA , P., D ING , N., G OODMAN , S., AND S ORICUT, R. Con-
tes. ceptual captions: A cleaned, hypernymed, image alt-text dataset for
[48] PARK , D. H., A ZADI , S., L IU , X., DARRELL , T., AND ROHRBACH , automatic image captioning. In Proc. of ACL (2018).
A. Benchmark for compositional text-to-image synthesis. In Proc. of [72] S HUMAILOV, I., ET AL . The curse of recursion: Training on generated
NeurIPS (2021). data makes models forget. arXiv preprint arxiv:2305.17493 (2023).
[49] P ODELL , D., ET AL . Sdxl: Improving latent diffusion models for [73] S OHL -D ICKSTEIN , J., W EISS , E., M AHESWARANATHAN , N., AND
high-resolution image synthesis. arXiv preprint arXiv:2307.01952 G ANGULI , S. Deep unsupervised learning using nonequilibrium ther-
(2023). modynamics. In Proc. of ICML (2015).

15
[74] S TABILITY AI. Stable Diffusion 2.0 Release, 2022. https://stabil [96] Z HANG , R., I SOLA , P., E FROS , A. A., S HECHTMAN , E., AND WANG ,
ity.ai/blog/stable-diffusion-v2-release. O. The unreasonable effectiveness of deep features as a perceptual
metric. In Proc. of CVPR (2018), pp. 586–595.
[75] S TABILITY AI. Stable Diffusion Public Release. , 2022. https:
//stability.ai/blog/stable-diffusion-public-release. [97] Z HU , C., H UANG , W. R., L I , H., TAYLOR , G., S TUDER , C., AND
G OLDSTEIN , T. Transferable clean-label poisoning attacks on deep
[76] S TABILITY AI. Stable Diffusion v2.1 and DreamStudio Updates 7-Dec
neural nets. In Proc. of ICML (2019).
22, 2022. https://stability.ai/blog/stablediffusion2-1-r
elease7-dec-2022. [98] Z HU , M., PAN , P., C HEN , W., AND YANG , Y. Dm-gan: Dynamic
memory generative adversarial networks for text-to-image synthesis.
[77] S TABILITY AI. Stability AI releases DeepFloyd IF, a powerful text-to-
In Proc. of CVPR (2019), pp. 5802–5810.
image model that can smartly integrate text into images, 2023. https:
//stability.ai/blog/deepfloyd-if-text-to-image-model.
[78] S TABILITYAI. Stable Diffusion v1-4 Model Card, 2022. https: A Appendix
//huggingface.co/CompVis/stable-diffusion-v1-4.
[79] T RAN , T. H. Image Apps Like Lensa AI Are Sweeping the Internet,
and Stealing From Artists, 2022. https://www.thedailybeast.co
A.1 Experiment Setup
m/how-lensa-ai-and-image-generators-steal-from-artis
ts.
In this section, we detail our experimental setup, including
model architectures, user study evaluations and model perfor-
[80] T URNER , A., T SIPRAS , D., AND M ADRY, A. Clean-label backdoor
attacks. mance evaluations.
[81] VAHDAT, A., K REIS , K., AND K AUTZ , J. Score-based generative Details on model architecture. In §6.1, we already de-
modeling in latent space. Proc. of NeurIPS (2021). scribe the LD-CC model for the training from scratch scenario.
[82] V INYALS , O., T OSHEV, A., B ENGIO , S., AND E RHAN , D. Show Here we provide details on the other three diffusion models
and tell: A neural image caption generator. corr abs/1411.4555 (2014). for the continuous training scenario.
arXiv preprint arXiv:1411.4555 (2014).
[83] WAN , A., WALLACE , E., S HEN , S., AND K LEIN , D. Poison- • Stable Diffusion V2 (SD-V2): We simulate the popular train-
ing language models during instruction tuning. arXiv preprint ing scenario where the model trainer updates the pretrained
arXiv:2305.00944 (2023). Stable Diffusion V2 model (SD-V2) [76] using new train-
[84] WANG , B., C AO , X., G ONG , N. Z., ET AL . On certifying robustness ing data [21]. SD-V2 is trained on a subset of the LAION-
against backdoor attacks via randomized smoothing. arXiv preprint aesthetic dataset [64]. In our tests, the model trainer contin-
arXiv:2002.11750 (2020).
ues to train the pretrained SD-V2 model on 50K text/image
[85] WANG , B., YAO , Y., S HAN , S., L I , H., V ISWANATH , B., Z HENG , pairs randomly sampled from the LAION-5B dataset along
H., AND Z HAO , B. Y. Neural cleanse: Identifying and mitigating
backdoor attacks in neural networks. In Proc. of IEEE S&P (2019), with a number of poison data.
IEEE, pp. 707–723. • Stable Diffusion XL (SD-XL): Stable Diffusion XL (SD-XL)
[86] W ENGER , E., PASSANANTI , J., B HAGOJI , A., YAO , Y., Z HENG , H., is the newest and the state-of-the-art diffusion model, out-
AND Z HAO , B. Y. Backdoor attacks against deep learning systems in performing SD-V2 in various benchmarks [49]. The SD-XL
the physical world. In Proc. of CVPR (2021). model has over 2.6B parameters compared to the 865M pa-
[87] W U , L., ET AL . Understanding and enhancing the transferability of rameters of SD-V2. SD-XL is trained on an internal dataset
adversarial examples. arXiv preprint arXiv:1802.09707 (2018). curated by StablityAI. In our test, we assume a similar train-
[88] X IANG , C. Ai is probably using your images and it’s not easy to opt ing scenario where the model trainer updates the pretrained
out. Motherboard, Tech by Vice, Sept 2022.
SD-XL model on a randomly selected subset (50K) of the
[89] X U , T., Z HANG , P., H UANG , Q., Z HANG , H., G AN , Z., H UANG , X., LAION-5B dataset and a number of poison data.
AND H E , X. Attngan: Fine-grained text to image generation with
attentional generative adversarial networks. In Proc. of CVPR (2018), • DeepFloyd (DF): DeepFloyd [77] (DF) is another popular
pp. 1316–1324. diffusion model that has a different model architecture from
[90] YANG , S. Why Artists are Fed Up with AI Art., 2022. LD, SD-V2, and SD-XL. We include the DF model to test
[91] YANG , Z., H E , X., L I , Z., BACKES , M., H UMBERT, M., B ERRANG ,
the generalizability of our attack across different model ar-
P., AND Z HANG , Y. Data poisoning attacks against multimodal en- chitectures. Like the above, the model trainer updates the
coders. In Proc. of ICML (2023). pretrained DF model using a randomly selected subset (50K)
[92] YAO , Y., L I , H., Z HENG , H., AND Z HAO , B. Y. Latent backdoor of the LAION-5B dataset and a number of poison data.
attacks on deep neural networks. In Proc. of CCS (2019), pp. 2041–
2055. Details on user study. We conduct our user study (IRB-
[93] Z HAI , S., ET AL . Text-to-image diffusion models can be easily approved) using Prolific with 185 participants. We select only
backdoored through multimodal data poisoning. arXiv preprint English speaking participants who have task approval rate
arXiv:2305.04175 (2023).
> 99% and have completed at least 100 surveys prior to our
[94] Z HANG , E., ET AL . Forget-me-not: Learning to forget in text-to-image
study. We compensate each participant at a rate of $15/hr.
diffusion models. arXiv preprint arXiv:2303.17591 (2023).
[95] Z HANG , J., L IU , H., J IA , J., AND G ONG , N. Z. Corruptencoder:
Details on evaluating a model’s CLIP alignment score
Data poisoning based backdoor attacks to contrastive learning. arXiv and FID. We follow prior work [56, 57] to query the
preprint arXiv:2211.08229 (2022). poisoned model with 20K MSCOCO text prompts (covering

16
a variety of objects and styles) and generates 20K images. We evaluated via human inspection. Mounting successful attacks
calculate the alignment score on each generated image and its on these models is more challenging than LD-CC, since pre-
corresponding prompt using the CLIP model. We calculate trained models have already learned each of the 121 concepts
FID by comparing the generated images with clean images from a much larger pool of clean samples (averaging at 986K
in the MSCOCO dataset using an image feature extractor samples per concept). However, by injecting 750 poisoning
model [32]. samples, the attack effectively disrupts the image generation
at a high (85%) probability, reported by both CLIP classifica-
A.2 PCA Visualization of Concept Sparsity tion and human inspection. Injecting 1000 poisoning samples
pushes the success rate beyond 90%.
We also visualize semantic frequency of text embeddings in an Figure 22 compares the CLIP attack success rate between
2D space. Figure 18 provides a feature space visualization of object and style concepts. We observe that the simple poison-
the semantic frequency for all the common concepts (nouns), ing attack is more effective at corrupting style concepts than
compressed via PCA. Each point represents a concept and object concepts. This is likely because styles are typically
its color captures the semantic frequency (darker color and conveyed visually by the entire image, while objects define
larger word font mean higher value, and the maximum value specific regions within the image.
is 4.17%). One can clearly observe the sparsity of semantic Concept Sparsity Affecting Attack Efficacy. Figure 23
frequency in the text embedding space. demonstrates how concept sparsity in terms of word frequency
impacts attack efficacy and we further study the impact of
semantic frequency in Figure 24. For this we sample 15 object
concepts with varying sparsity levels, in terms of word and
semantic frequency discussed in §3.3. As expected, poison-
ing attack is more successful when disrupting more sparse
concepts Moreover, semantic frequency is a more accurate
representation of concept sparsity than word frequency, be-
cause we see higher correlation between semantic frequency
and attack efficacy. These empirical results confirm our hy-
pothesis in §3.2.

CLIP attack success rate on artist names


Task
100 poison 200 poison 300 poison
LD-CC 80% 91% 96%
Figure 18. 2D PCA visualization of semantic frequency in LAION-Aesthetic. SD-V2 81% 94% 97%
Darker dots and larger word fonts correspond to concepts with higher se- SD-XL 77% 92% 99%
mantic frequencies (max=4.17%). We randomly pick concepts to show their DF 80% 96% 99%
word content.
Table 7. Poison attack damages related concepts (artist names) when the
attacker poisons given art styles across 4 generation models.

A.3 Additional Results of Simple Dirty-Label


Poisoning Attacks L2 Distance to Average Number of Average CLIP attack success rate
source concept(D) Concepts Included 100 poison 200 poison 300 poison
Attacking LD-CC. Figure 19 illustrates the attack success D=0 1 84% 94% 96%
rate of the simple, dirty-label poisoning attack (§4), evaluated 0 < D ≤ 3.0 5 81% 93% 96%
3.0 < D ≤ 6.0 13 78% 90% 92%
by both a CLIP-based classifier and human inspectors. In this 6.0 < D ≤ 9.0 52 32% 41% 59%
training-from-scratch scenario, for each of the 121 concepts D > 9.0 1929 5% 5% 6%
targeted by the attack, the average number of clean training Table 8. Bleed through performance of the enhanced poison. (SD-XL)
samples semantically associated with each concept is 2260.
Results show that, adding 500 poison training samples can
effectively suppress the influence of these clean data samples
during model training, resulting in an attack success rate of
A.4 Additional Results on Bleed through and
82% (human inspection) and 77% (CLIP classification). In- Stacking Multiple Attacks
jecting 1000 poison data further boosts the attack success rate We evaluate the “related” concept bleed-through effects be-
to 98% (human) and 92% (CLIP). tween artists and the art styles they are known for. We in-
Attacking SD-V2, SD-XL, DeepFloyd. Figure 20 shows clude 195 artists associated with 28 styles from the Wikiart
the poisoning result in the continuous training scenario as- dataset [60]. We poison each art style C , then test poison’s
sessed by the CLIP classifier and Figure 21 shows the result impact on generating painting of artists whose style belong

17
to style C , without mentioning the poisoned style C in the (Table 8).
prompt, e.g., query with “a painting by Picasso” for models Stacking multiple poisons. Table 9 lists, for the LD-CC
with “cubism” poisoned. Table 7 shows that with 200 poison model, the overall model performance in terms of the CLIP
data on art style, Nightshade achieves > 91% CLIP attack alignment score and FID, when an increased number of con-
success rate on artist names alone, similar to its performance cepts are being poisoned.
on the poisoned art style.
# of poisoned Overall model Performance
Enhancing bleed-through. We can further enhance our Approach
concepts Alignment Score FID
poison attack’s bleed though by broadening the sampling pool (higher better) (lower better)
of poison text prompts: sampling text prompts in the text se- Clean LD-CC 0 0.31 17.2
mantic space of C rather than with exact word match to C . As Poisoned LD-CC 100 0.29 22.5
a result, selected poison data will deliberately include related Poisoned LD-CC 250 0.27 29.3
concepts and lead to a broader impact. Specifically, when we Poisoned LD-CC 500 0.24 36.1
Poisoned LD-CC 1000 0.22 44.2
calculate activation similar to the poisoned concept C , we AttnGAN - 0.26 35.5
use all prompts in LAION-5B dataset (does not need to in- A model that outputs
- 0.20 49.4
clude C ). Then we select top 5K prompts with the highest random noise
activation, which results in poison prompts containing both
Table 9. Overall model performance (in terms of the CLIP alignment score
C and nearby concepts. We keep the rest of our poison gen- and FID) when an increasing number of concepts are being poisoned. We
eration algorithm identical. This enhanced attack increases also show baseline performance of a GAN model from 2017 and a model
bleed through by 11% in some cases while having minimal that output random Gaussian noise. (LD-CC)
performance degradation (< 1%) on the poisoned concept

18
Human-rated Attack Success
1 1 1

CLIP Attack Success


0.8 0.8 0.8
Attack Success

0.6 0.6 0.6

0.4 0.4 0.4

0.2 0.2 SD-V2 0.2 SD-V2


CLIP SD-XL SD-XL
Human-rated DFd DFd
0 0 0
0 100 250 500 750 1000 0 100 250 500 750 1000 0 100 250 500 750 1000
Number of Poison Data Injected Number of Poison Data Injected Number of Poison Data Injected

Figure 19. Attack success rate of the simple, dirty- Figure 20. Attack success rate of the simple, dirty- Figure 21. Attack success rate of the simple, dirty-
label poisoning attack, measured by the CLIP clas- label poisoning attack, measured by the CLIP clas- label poisoning attack, measured by human inspec-
sifier and human inspectors, vs. # of poison data sifier, vs. # of poison data injected, when attacking tors, vs. # of poison data injected, when attacking
injected, when attacking LD-CC (training from each of three models SD-V2, SD-XL, DeepFloyd each of three models SD-V2, SD-XL, DeepFloyd
scratch). (continuous training). (continuous training).

1
CLIP Attack Success Rate

1 1
CLIP Attack Success

CLIP Attack Success

0.8

0.6 0.9 0.9

0.4
0.8 0.8
0.2
Object Concepts
Style Concepts
0
0 250 500 750 1000 0.7 0.7
0% 0.05% 0.1% 0.15% 0% 0.1% 0.2% 0.3% 0.4%
Number of Poison Data Injected
Word Frequency Semantic Frequency
Figure 22. Attack success rate of the simple poison
Figure 23. Success rate of the simple poisoning Figure 24. Success rate of the simple poisoning
attack against LD-CC, measured by the CLIP clas-
attack (rated by CLIP classifier) is weakly correlated attack (rated by CLIP classifier) correlates strongly
sifier. The simple poisoning attack is more effective
with concept sparsity measured by word frequency with concept sparsity measured by semantic fre-
at corrupting style concepts than object concepts.
in the training data. Results for LD-CC. Same trend quency. Results for LD-CC. Same trend observed
The same applies to attacks against SD-V2, SD-XL,
observed on SD-V2, SD-XL, DeepFloyd. on SD-V2, SD-XL, DeepFloyd.
DeepFloyd.

19

You might also like