Fake Images Generated by Text-to-Image Generation Models
Fake Images Generated by Text-to-Image Generation Models
Fake Images Generated by Text-to-Image Generation Models
Abstract
Text-to-image generation models that generate images based
arXiv:2210.06998v2 [cs.CR] 9 Jan 2023
1
1.1 Our Contributions Different from RQ1 and RQ2, RQ3 focuses on the impact
There are multiple approaches to alleviate the concerns of prompts on the authenticity of generated images. To this
brought by advanced generation models. In particular, one end, we conduct prompt analysis from semantic and struc-
can build a detector to detect whether an image is real or tural perspectives. In the former, we design two semantic
fake automatically. Moreover, one can build an attribution extraction methods to analyze the impact of prompt topics
model to attribute a synthetic image to its source generation on the authenticity of fake images. More specifically, the
model, such that the model owner can be held responsible for first one directly uses the ground truth topics provided in the
the model’s misuse. So far, various efforts have been made dataset for each prompt, and the second one automatically
in this field; however, they only focus on traditional genera- clusters the various prompts into different groups and ex-
tion models, represented by GANs [12, 33, 36]. To the best tracts topics from these groups. From the structural perspec-
of our knowledge, no study has been done on text-to-image tive, we conduct the study based on the length of prompts
generation models. Also, whether the prompts used in such and the proportion of nouns in prompts, respectively. Fig-
models can facilitate fake image detection and attribution re- ure 1 presents an overview of our methods to address the
mains unexplored. three research questions.
In this work, we present the first study on the detection and Evaluation. We perform experiments on two bench-
attribution of fake images generated by text-to-image gener- mark prompt-image datasets including MSCOCO [19] and
ation models. Concretely, we formulate the following three Flickr30k [35], and four popular pre-trained text-to-image
research questions (RQs). generation models including Stable Diffusion [27], Latent
Diffusion [27], GLIDE [21], and DALL·E 2 [23].
• RQ1. Can we differentiate the fake images gener- In fake image detection, extensive experimental results
ated by various text-to-image generation models from show that image-only detectors can achieve good perfor-
the real ones, i.e., detection of fake and real images? mance in some cases, while hybrid detectors can always
achieve better performance in all cases. For example, on
• RQ2. Can we attribute the fake images to their source MSCOCO [19], the image-only detector trained on fake im-
text-to-image generation models, i.e., attribution of fake ages generated by Stable Diffusion can achieve an accuracy
images to their sources? of 0.834 in differentiating fake images generated by Latent
Diffusion from the real ones, while it can only achieve 0.613
• RQ3. What kinds of prompts are more likely to gener- and 0.554 on GLIDE and DALL·E 2, respectively. Encour-
ate authentic images? agingly, the hybrid detector trained on fake images from Sta-
ble Diffusion achieves 0.932/0.899/0.885 accuracy on La-
Methodology. To differentiate fake images from real ones tent Diffusion/GLIDE/DALL·E 2 with natural prompts, and
(RQ1), i.e., fake image detection, we train a binary clas- 0.945/0.909/0.891 accuracy with BLIP-generated prompts.
sifier/detector. To validate the generalizability of the de- These results demonstrate that fake images from various
tector, we especially train it on fake images generated by models can indeed be distinguished from real images. We
only one model and evaluate it on fake images generated by further extract a common feature from fake images gener-
many other models. We consider two detection methods, i.e., ated by various models in Section 3.4, which implies the ex-
image-only and hybrid, depending on the detector’s knowl- istence of a common artifact shared by fake images across
edge. The image-only detector makes its decision solely various models.
based on the image itself. The hybrid detector considers both In fake image attribution, our experiments show that both
images and their corresponding prompts. Hybrid detection is image-only and hybrid attributors can achieve good perfor-
a brand-new detection method, and it is designed specifically mance in all cases. Similarly, the hybrid attributor is better
for detecting fake images created by text-to-image generation than the image-only one. For instance, the image-only at-
models. Concretely, we leverage the image and text encoders tributor can achieve an accuracy of 0.815 in attributing fake
of the CLIP model [22] to transfer an image and its prompt images to the models we consider, while the hybrid attributor
to two embeddings which are then concatenated as the input can achieve 0.880 with natural prompts and 0.850 with BLIP-
to the detector. Note that in the prediction phase, an image’s generated prompts. These results demonstrate that fake im-
natural prompt may not be available. In such cases, we lever- ages can indeed be attributed to their corresponding text-to-
age an image captioning model BLIP [16] to generate the image generation models. We further show the unique fea-
prompt for the image. ture extracted from each model in Section 4.4, which implies
To attribute a fake image to its source model (RQ2), we that different models leave unique fingerprints in the fake im-
propose fake image attribution by training a multi-class clas- ages they generate.
sifier (instead of a binary classifier), and we name this clas- In prompt analysis, we first find that prompts with the top-
sifier as an attributor. Specifically, the attributor is trained on ics of “skis,” and “snowboard” tend to generate more au-
fake images generated by multiple text-to-image generation thentic images through our first semantic extraction method,
models. Fake images from the same model are labeled as the which relies on the ground truth information from the dataset.
same class. Moreover, we also establish the attributor by two However, by clustering various prompts over embeddings
methods, i.e., image-only and hybrid, which are the same as by sentence transformer [26], we find that prompts with the
the detector to address RQ1. “person” topic can actually generate more authentic images.
2
Upon further inspection, we discover that most of the images from both images and prompts. In our experiments,
associated with “skis” and “snowboard” are also related to we adopt a Pytorch version of DALL·E 2 released by
“person.” These results indicate that prompts with the topic DALL·E 2-Pytorch.6 DALL·E 2-Pytorch uses an extra
“person” are more likely to generate authentic fake images. layer of indirection with the prior network and is trained
From the structural perspective, our experiments show that on a subset of LAION-5B [30]. Note that we refer to
prompts with a length between 25 and 75 enable text-to- DALL·E 2-Pytorch as DALL·E 2 in the rest of our pa-
image generation models to generate fake images with higher per.
authenticity, while the proportion of nouns in the prompt has
no significant impact. 2.2 Datasets
Implications. In this paper, we make the first attempt to We use the following two benchmark prompt-image datasets
tackle the threat caused by fake images generated by text- to conduct our experiments.
to-image generation models. Our results on detecting fake
images and attributing them to their source models are en- • MSCOCO [19]. MSCOCO is a large-scale objective,
couraging. This suggests that our solution has the potential segmentation, and captioning dataset. It is a standard
to play an essential role in mitigating the threats. We will benchmark dataset for evaluating the performance of
share our source code with the community to facilitate re- computer vision models. MSCOCO contains 328,000
search in this field in the future. images distributed in 80 classes of natural objects, and
each image in MSCOCO has several corresponding cap-
tions, i.e., prompts. In this work, we consider the first
2 Preliminaries 60,000 prompts due to the constraints of our lab’s com-
2.1 Text-to-Image Generation Models putational resources.
During the past few months, text-to-image generation mod- • Flickr30k [35]. Flickr30k is a widely used dataset
els have attracted an increasing amount of attention. A model for research on image captioning, language understand-
in this domain normally takes a prompt, i.e., a piece of text, ing, and multimodal learning. It contains 31,783 images
and a random noise as the input and then denoises the im- and 158,915 English prompts on various scenarios. All
age under the guidance of the prompt so that the generated images are collected from the Flickr website, and the
image matches the description. In this work, we focus on prompts are written by Flickr users in natural language.
four popular text-to-image generation models that are pub-
licly available online. Note that the prompts from these datasets are also important
as they will be used to generate fake images or serve as inputs
• Stable Diffusion [27]. Stable Diffusion (SD) is a dif- for the hybrid classifiers (see Section 3 for more details).
fusion model for text-to-image generation. The avail- In summary, the text-to-image generation models and
able model we use3 is pre-trained on 512×512 images datasets we consider in this work are listed in Table 1. Since
from a subset of the LAION-5B [30] dataset. The CLIP different models are trained on images of different sizes and
model’s [22] text encoder is used to condition the model fake images usually appear in the real world at different res-
on prompts. olutions. Therefore, we adopt the default settings of these
available models and perform experiments on fake images of
• Latent Diffusion [27]. Latent Diffusion (LD) is also a different sizes.
diffusion model for text-to-image generation. The avail-
able model we use4 is pre-trained on LAION-400M, 3 Fake Image Detection
a smaller dataset sampled from LAION-5B [30]. La-
In this section, we present our fake image detector to differ-
tent Diffusion also leverages the text encoder of CLIP
entiate fake images from real ones (RQ1). We start by intro-
to guide the direction of fake images.
ducing our design goals. Then, we present how to construct
• GLIDE [21]. GLIDE is a text-to-image generation the detector. Finally, we show the experimental results.
model proposed by OpenAI. The available model5 is
trained on a filtered version of a dataset comprised of
3.1 Design Goals
several hundred million prompt-image pairs. In addi- To tackle the threats posed by the misuse of various text-to-
tion, GLIDE is not good at understanding prompts that image generation models, the design of our detector should
contain “person” topics, as such images have been re- follow the following points.
moved from the training dataset due to ethical concerns.
• Differentiating Between Fake and Real Images. The
• DALL·E 2 [23]. DALL·E 2 is one of the most popular primary goal of the detector is to effectively differenti-
text-to-image generation models proposed by OpenAI. ate fake images generated by text-to-image generation
A transformer [32] is used to capture the information models from real ones. Successful detection of fake im-
3 https://github.com/CompVis/stable-diffusion.
ages can reduce the threat posed by the misuse of these
4 https://github.com/CompVis/latent-diffusion. advanced models.
5 https://github.com/openai/glide-text2im. 6 https://github.com/lucidrains/DALLE2-pytorch.
3
Table 1: The text-to-image generation models, datasets, and the
number/size of fake images we consider in this work. Note that
the number of fake images from DALL·E 2 being low is due to
CNN
its poor image generation efficiency.
+
MSCOCO 31,276 256×256 A man is in a kitchen
Classi cation
LD making pizzas Layer
Flickr30k 17,969 256×256
MSCOCO 41,685 256×256 CLIP Text
GLIDE Encoder
Flickr30k 27,210 256×256 A man is in a kitchen
making pizzas
MSCOCO 1,028 256×256 Real images
DALL·E 2 Image-Only detection Image embedding
Flickr30k 300 256×256
Hybrid detection Prompt embedding Fake images
generated by SD
Fake images generated by other text-to-image
generation models Input
• Agnostic to Models and Datasets. As text-to-image
generation models have undergone rapid development, Figure 2: An illustration of fake image detection. The red part
it is likely that more advanced models will be proposed describes image-only detection. The green part describes hy-
in the future. As a result, it is difficult for us to collect brid detection. The blue part describes fake images generated
all text-to-image generation models to build our detec- by other text-to-image generation models.
tor. Moreover, building the detector on various models fi
4
1.0 1.0
cannot fully and faithfully describe. However, since fake im-
ages are generated based on prompts, they may not contain 0.8 0.8
Accuracy
Accuracy
0.6 0.6
formative as real images. Therefore, introducing prompts to-
0.4 Forensic classifier 0.4
gether with images enlarges the disparity between fake and Image-Only
0.2 0.2
real images, which in our opinion can contribute to differen- Hybrid (natural prompts)
Hybrid (generated prompts)
tiating between the two. We further show in Section 3.4 that 0.0
SD LD GLIDE DALL·E 2
0.0
SD LD GLIDE DALL·E 2
the disparity between real and fake images is indeed huge
from the prompt’s perspective. Note that using prompts as an (a) MSCOCO (b) Flickr30k
extra signal for fake image detection is novel and unique to
Figure 3: The performance of the forensic classifier and de-
text-to-image generation models, as prompts do not partici- tectors. We conduct the evaluation on (a) MSCOCO and (b)
pate in the image generation process of traditional generation Flickr30k, respectively.
models, like GANs.
The green part of Figure 2 shows the work pipeline of our
hybrid detection. Specifically, the process of training our hy- and low-level vision models [4, 7], as a baseline. The au-
brid detector can also be divided into three stages, i.e., data thors of [33] name their classifier as forensic classifier. Note
collection, dataset construction, and detector construction. that this forensic classifier is the state-of-the-art fake image
detector for generation models, and the authors show that it
• Data Collection. To collect the real and fake images, has strong generalizability. For instance, it can achieve an
we follow the same step as the first step for the image- accuracy of 0.946 on differentiating fake images generated
only detector. by StarGAN [6], which is not considered during the model
training, from real images.
• Dataset Construction. Since our hybrid detector takes
Figure 3 depicts the evaluation results. First of all, we
images and prompts as input, we label all fake images
can observe that the forensic classifier cannot effectively dis-
and their corresponding prompts as 0 and label real im-
tinguish fake images (generated by text-to-image generation
ages and their corresponding prompts as 1. Similarly,
models) from real ones. In all cases, the forensic classifier
we then create a training dataset containing a total of
only achieves an accuracy of 0.5, which is equivalent to a
40,000 prompt-image pairs.
random guess. Based on this observation, we can conclude
• Detector Construction. To exploit the prompt infor- that the forensic classifier cannot be generalized to text-to-
mation, we take advantage of CLIP’s image encoder and image generation models. We attribute this observation to
text encoder as feature extractors to obtain high-level the differences between traditional generation models and
embeddings of images and prompts. Then, we concate- text-to-image generation models. This result also prompts
nate image embeddings and text embeddings together the urgent need for counterpart solutions against the misuse
as new embeddings and use these embeddings to train a of text-to-image generation models.
binary classifier, i.e., a 2-layer multilayer perceptron, as Furthermore, we can observe that the image-only detec-
our detector. tor performs much better in all cases than the forensic clas-
sifier. For example, the image-only detector can achieve
To evaluate the trained hybrid detector, we need both im- an accuracy of 0.871 in distinguishing fake images gener-
ages and their corresponding prompts. Typically, a user may ated by LD+Flickr30k (querying the prompts of Flickr30k to
attach a description to an image they post on the Internet. LD) from real images. We emphasize here that the image-
Therefore, we can directly consider this attached description only detector is trained only on fake images generated by
as the prompt for the image. In our experiments, we adopt SD+MSCOCO and has never seen fake images generated by
the original/natural prompts from the dataset to conduct the other models given prompts from other datasets. We conjec-
evaluation. ture that this is due to some common properties shared by all
In a more realistic and challenging scenario where the de- fake images generated by text-to-image generation models
tector cannot obtain the natural prompts, we propose a sim- (see Section 3.4 for more details).
ple yet effective method to generate the prompts ourselves. Lastly, another interesting finding is the much larger vari-
Concretely, we leverage the BLIP [16] model (an image cap- ation in detection performance due to the effect of the model
tioning model) to generate captions for the queried images compared to the effect of the dataset. E.g., in Figure 3a, the
and then regard these generated captions as the prompts for image-only detector achieves an accuracy of 0.913 on SD but
the images. only 0.526 on DALL·E 2. In contrast, comparing Figure 3a
and Figure 3b, the image-only detector achieves very close
3.3 Results accuracy on different datasets over all text-to-image gener-
We now present the performance of our proposed image-only ation models. We attribute this observation to the unique
detection and hybrid detection for fake image detection. fingerprint of fake images generated by text-to-image gen-
Image-Only Detection. For a convincing evaluation, we eration models (see Section 4.4).
adopt the existing work [33] on detecting fake images gener- Hybrid Detection. Although the image-only detector
ated by various types of generation models, including GANs achieves better performance in all cases compared to the
5
1000
Fake
Real
800
Number of samples
600
400
200
6
spectively. Then, we calculate two cosine similarities, one 1.0
0.8
is between the prompt’s embedding and its real image’s em- 0.8
Accuracy
Accuracy
0.6
and its fake image’s embedding. Finally, the two cosine sim- 0.4 SD 0.4 SD
LD LD
ilarities are normalized into a probability distribution via a 0.2
GLIDE
0.2
GLIDE
softmax function [22]. Higher probability implies a stronger 0.0 DALL·E 2 0.0 DALL·E 2
connection between the image and the prompt. Figure 5 −0.50 −0.25 0.00
Descriptiveness
0.25 0.50 −0.50 −0.25 0.00 0.25
Descriptiveness
0.50
7
1.0 1.0 1.0 1.0
Accuracy
Accuracy
Accuracy
0.6 0.6 0.6 0.6
Image-Only
0.4 0.4 0.4 0.4
Hybrid (natural prompts)
Hybrid (generated prompts)
0.2 0.2 0.2 0.2
500 1000 5000 10000 20000 40000 500 1000 5000 10000 20000 40000 500 1000 5000 10000 20000 40000 500 1000 5000 10000 20000 40000
Train dataset size Train dataset size Train dataset size Train dataset size
Figure 7: The performance of detectors in terms of the training dataset size on SD+MSCOCO. We conduct the evaluation on (a)
SD+MSCOCO, (b) LD+MSCOCO, (c) GLIDE+MSCOCO, and (d) DALL·E 2+MSCOCO.
ages. Specially, we propose two methods to construct the Similar to fake image detection, we propose two differ-
binary detector, namely image-only and hybrid. Our evalua- ent approaches to establish the attributor, namely image-only
tion shows that the fake images from various models can in- and hybrid. The image-only attributor accepts only images
deed be differentiated from the real ones. Moreover, the hy- as input, and the hybrid attributor accepts both images and
brid detector can obtain much better performance compared their corresponding prompts as input.
to the image-only detector, which demonstrates that intro- Image-Only Attribution. The process of establishing our
ducing prompts together with images can indeed amplify the image-only attributor can also be divided into three stages,
differences between fake and real images. namely, data collection, dataset construction, and attributor
construction.
4 Fake Image Attribution
• Data Collection. We first randomly sample 20,000
The previous section has shown that fake image detection, es- images from MSCOCO as real images. Then, we use
pecially the hybrid detection we have proposed, can achieve the prompts of these 20,000 images to query each model
remarkable performance. In this section, we explore whether to get 20,000 fake images accordingly. Here, we adopt
fake images generated by various text-to-image generation SD, LD, and GLIDE to generate fake images. In total,
models can be attributed to their source models, i.e., fake we have obtained 60,000 fake images. The reason we do
image attribution. We start by introducing our design goals. not consider DALL·E 2 is that we will use DALL·E 2 for
We then describe how to construct the fake image attributor. the experiments regarding adaptation to other models
Finally, we present the evaluation results. (see Section 4.4).
4.1 Design Goals • Dataset Construction. We label all real images
To attribute fake images to their source models, we follow as 0 and all fake images from the same model as the
two design goals. same class. Concretely, we label the fake images from
SD/LD/GLIDE as 1/2/3. We then create a training
• Tracking Sources of Fake Images. The primary goal dataset containing a total of 80,000 images with four
of fake image attribution is to effectively attribute differ- classes.
ent fake images to their source generation models. The
aim of attribution is to let a model owner be held re- • Attributor Construction. We build the fake image at-
sponsible for the model’s (possible) misuse. Previously, tributor, i.e., a multi-class classifier, that accepts images
fake image attribution has been studied in the context of as input and outputs the multi-class prediction, i.e., 0-
traditional generation models, like GANs [36]. real, 1-SD, 2-LD, or 3-GLIDE. We train the attributor
from scratch using the created training dataset in con-
• Agnostic to Datasets. In the real world, a fake image
junction with classical training techniques. Similar to
can be generated by a text-to-image generation model
the fake image detector, we leverage ResNet18 [14] as
based on a prompt from any distribution. Therefore, to
the attributor’s architecture.
be more practical, the attribution should be independent
of the prompt distribution. After we have trained the attributor, we evaluate the per-
formance of the trained attributor for attributing images
4.2 Methodology from various sources (i.e., real, SD, LD, and GLIDE) given
To attribute the fake images to their sources, we construct prompts from the other dataset (i.e., Flickr30k). For testing
fake image attribution by training a multi-class classifier, re- the attributor, we sample the same number of images for all
ferred to as an attributor, with each class corresponding to four classes, i.e., 10,000 each and 40,000 in total.
one model. As aforementioned, the attributor should be ag- Hybrid Attribution. The previous evaluation in fake-
nostic to datasets; thus, we establish the multi-class classifier image detection has demonstrated the superior performance
based on prompts from only one dataset, e.g., MSCOCO, and of the hybrid detector, verifying that introducing prompts to-
test it on prompts from other datasets like Flickr30k. gether with images can amplify the differences between fake
8
and real images. We now conduct the study to investigate Table 2: The performance of image-only attributors and hybrid
whether a similar enhancement can be observed in the case attributors.
of hybrid attribution.
The hybrid attributor is quite similar to the above image- MSCOCO Flickr30k
only attributor, which also consists of three stages, i.e., data Image-Only 0.864 0.863
collection, dataset construction, and attributor construction. Hybrid (natural prompts) 0.936 0.933
Hybrid (generated prompts) 0.903 0.892
• Data Collection. To collect the images from various
sources, we follow the same step as the first step for the
image-only attributor. hybrid attribution, no matter with or without natural prompts,
achieves better performance than image-only attribution re-
• Dataset Construction. Since our hybrid attributor gardless of the dataset. These results demonstrate once again
takes images and prompts as input, we label all real im- that fake images can be successfully attributed to their corre-
ages with their corresponding prompts as 0 and all fake sponding text-to-image generation models. Also, they verify
images from the same model with their corresponding that using prompts as an extra signal can improve attribution
prompts as the same class. Similarly, we then create performance.
a training dataset containing a total of 80,000 prompt-
images pairs with four classes. 4.4 Discussion
• Attributor Construction. To exploit the prompt in- The above evaluation demonstrates the effectiveness of our
formation, we again use CLIP’s image encoder and text fake image attribution. We conjecture that each text-to-image
encoder as feature extractors to obtain high-level em- generation model leaves a unique fingerprint in the fake im-
beddings of images and prompts. Then, we concate- ages it generates. Next, we verify this conjecture by visu-
nate image embeddings and text embeddings together alizing the fingerprints of different models. Besides, in the
as new embeddings and use these embeddings to train a previous evaluation, the training and testing images for our
multi-class classifier, which is also a 2-layer multilayer attributor are disjoint but generated by the same set of text-to-
perceptron, as our attributor. image generation models. We further explore how to adapt
our attributor to other models that are not considered during
In order to evaluate the trained hybrid attributor, we need training.
images and their corresponding prompts. We again con- Fingerprint Visualization. Similar to visualizing the
sider two scenarios here, one in which we can directly ob- shared artifact across fake images (see Section 3.4), we also
tain prompts for the images from the dataset and the other in draw the frequency spectra of different text-to-image genera-
which we can only generate prompts for the images relying tion models built on MSCOCO. For each text-to-image gen-
on BLIP. eration model, we randomly select 2,000 fake images and
then calculate the average of their Fourier transform outputs.
4.3 Results As shown in Figure 8, we can clearly observe that
In this section, we present the performance of our proposed there are distinct patterns in images generated by different
two types of fake image attribution. text-to-image generation models, especially in GLIDE and
DALL·E 2. We can also find that the frequency spectra of SD
Image-Only Attribution. We report the performance of
is similar to that of LD, which can explain why the image-
image-only attribution in Table 2. Note that the random guess
only detector built on SD can also achieve very strong per-
for the 4-class classification task is only 0.25. We can find
formance on LD (see Figure 3). The reason behind this is
that our proposed image-only attributor can achieve remark-
that SD and LD follow similar algorithms, although trained
able performance. For instance, the image-only attributor can
on different datasets and different model architectures. In
achieve an accuracy of 0.864 on images from various sources
conclusion, the qualitative evaluation verifies that each text-
given the prompts sampled from MSCOCO. These results in-
to-image generation model has its unique fingerprint.
dicate that the fake images can be effectively attributed to
their corresponding text-to-image generation models. We Adaptation to Unseen Models. In previous experiments,
further show the unique feature extracted from each model we evaluate attribution on fake images generated by mod-
in Section 4.4, which implies that different models may leave els considered during training. However, there are instances
unique fingerprints in the fake images they generate. when we encounter fake images that are not from models in-
Further, the image-only attributor can also achieve a high volved in training, i.e., unseen models. Next, we explore how
accuracy of 0.863 on images from various source models to adapt our attributor to unseen models.
given the prompts sampled from the other dataset Flickr30k. To this end, we propose a simple yet effective approach
Note that we construct the attributor based on MSCOCO named confidence-based attribution. The key idea is to at-
only. This result indicates that our proposed image-only at- tribute the unconfident samples from the attributor predic-
tribution is agnostic to datasets. tion, i.e., lower than a pre-defined threshold, to unseen mod-
els. Here, all unseen models are considered as one class.8 In
Hybrid Attribution. Table 2 also depicts the performance
of our proposed hybrid attribution. We can clearly see that 8 In the current version, our approach cannot differentiate fake images from
9
(a) SD (b) LD (c) GLIDE (d) DALL·E 2
Figure 8: The visualization of frequency analysis on fake images generated by (a) SD, (b) LD, (c) GLIDE, and (d) DALL·E 2.
Accuracy
Accuracy
Accuracy
Accuracy
Figure 9: The performance of attributors on an unseen dataset Figure 10: The performance of attributors in terms of the train-
DALL·E 2 in terms of different thresholds. We conduct the eval- ing dataset size on MSCOCO. We conduct the evaluation on (a)
uation on (a) MSCOCO and (b) Flickr30k. MSCOCO and (b) Flickr30k.
10
Table 3: Top five prompts which can generate the most real or fake images, determined by the image-only detector. Gray cells in Real
mean the prompt mainly describe the details of the subject. Gray cells in Fake mean the prompt mainly describe the environment
where the subject is located.
0.40
0.35
0.30
0.25
0.20
0.15
0.10
0.05
(a) Skis (b) Snowboard
0.00
skis
kite
bird
cat
zebra
sink
snowboard
giraffe
sheep
chair
toilet
bottle
bowl
person
airplane
suitcase
banana
teddy bear
umbrella
frisbee
11
1.0 1.0 a significant impact on fake images’ authenticity.
0.8 0.8
Authenticity
Authenticity
0.6 0.6 5.3 Takeaways
0.4 0.4
In summary, we conduct semantic analysis and structure
0.2 0.2
analysis to study which types of prompts are more likely to
0.0 0.0
50 100 150 200 250 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 drive text-to-image generation models to generate fake im-
Length Proportion of nouns
ages with high authenticity. Empirical results demonstrate
that a prompt with the topic “person” or length between 25
(a) Length (b) Proportion of noun
and 75 is more likely to produce authentic images, thus lead-
Figure 13: The relationship between the length\proportion of ing to difficulties in detection by our designed detectors.
nouns in a prompt and the corresponding image’s authenticity.
6 Related Work
group the embeddings with DBSCAN [11], an advanced
6.1 Text-to-Image Generation
clustering method. The advantage of the second approach
is that it can implicitly reflect the in-depth semantics of the Typically, text-to-image generation takes a text description
prompts, which is also a common practice in the natural lan- (i.e., a prompt) as input and outputs an image that matches
guage processing literature. However, the disadvantage of the text description. Some pioneer works of text-to-image
this approach is that the concrete topic of each cluster needs generation [25, 38] are based on GANs [13]. By combining
to be manually summarized. By manually checking, the clus- a prompt embedding and a latent vector, the authors expect
ter with the highest real image proportion is related to the the GANs to generate an image depicting the prompt. These
topic “person.” Ostensibly, this is different from the results of works have stimulated more researchers [3, 15, 18, 31, 34, 37]
the first approach (“skis” and “snowboard” ranked the high- to study text-to-image generation models based on GANs,
est), which is based on the topics provided by MSCOCO. but using GANs does not always achieve good generation
However, by manually checking fake images by prompts performance [24, 27].
with topics “skis” and “snowboard,” we discover that most Recently, text-to-image generation has made great
of them depict “person” as well. We show some examples in progress with the emergence of diffusion models [1, 21,
Figure 12. This indicates the prompts related to “person” are 27, 28]. Models in this domain normally take random
likely to generate authentic fake images. noise and prompts as input and reduce noisy images to
We further extract the top 5 prompts that can generate the clear ones based on the guidance of prompts. Currently,
most real and fake images, respectively, according to the text-to-image generation based on diffusion models, such
image-only detector, and list them in Table 3. We can also as DALL·E [24], Stable Diffusion [27], Imagen [28],
find that detailed descriptions of the subjects contribute to GLIDE [21] and DALL·E 2 [23], has achieved state-of-the-
the generation of authentic images. For example, of the top art performance compared to previous works. This is also the
five real prompts, four provide a detailed description of the reason why we focus on such models in this work.
subject, while four of the top five fake prompts describe the
environment where the subject is located, rather than the sub- 6.2 Fake Image Detection and Attribution
ject itself. In the future, we plan to investigate in-depth the
relationship between the prompts’ semantics and the gener- Wang et al. [33] find that a simple CNN model can eas-
ated images’ authenticity. ily detect fake images generated by various types of tradi-
tional generation models (e.g., GANs [13] and low-level vi-
sion models [4, 7]) from real images. The authors argue that
5.2 Structure Analysis these fake images have some common defects that allow us
After semantic analysis, we now conduct the structure anal- to distinguish them from real images. Yu et al. [36] demon-
ysis. Specifically, we study prompt structure from two an- strate that fake images generated by various traditional gen-
gles, i.e., the length and the proportion of nouns. The length eration models can be attributed to their sources and reveal
of the prompt reflects the prompt complexity. The propor- the fact that these traditional generation models leave finger-
tion of nouns is related to the number of objects appearing prints in the generated images. Girish et al. [12] further pro-
in the fake image. Here, we use Natural Language Toolkit pose a new attribution method to deal with the open-world
(NLTK) [2] to compute the proportion of nouns in a prompt. scenario where the detector has no knowledge of the genera-
In our experiments, we randomly select 5,000 prompts tion model.
from MSCOCO and then feed these prompts to SD to gen- We emphasize here that almost all existing works focus
erate fake images. Results are shown in Figure 13. We only on traditional generation models, such as GANs [13],
can see from Figure 13a that both extremely long and short low-level vision models [4, 7], and perceptual loss genera-
prompts cannot generate authentic images. In addition, al- tion models [5, 17]. Detecting and attributing fake images
most all high-authenticity images are generated by prompts generated by text-to-image generation models are largely un-
with lengths between 25 to 75. On the other hand, Figure 13b explored. In this work, we take the first step to systematically
shows that the proportion of nouns in prompts does not have study the problem.
12
7 Conclusion [2] Steven Bird and Edward Loper. NLTK: The Natural Language
Toolkit. In Annual Meeting of the Association for Computa-
In this paper, we delve into three research questions concern- tional Linguistics (ACL). ACL, 2004. 12
ing the detection and attribution of fake images generated [3] Navaneeth Bodla, Gang Hua, and Rama Chellappa. Semi-
by text-to-image generation models. To solve the first re- supervised FusedGAN for Conditional Image Generation. In
search question of whether we can distinguish fake images European Conference on Computer Vision (ECCV), pages
apart from real ones, we propose fake image detection. Our 689–704. Springer, 2018. 12
fake image detection consists of two types of detectors: an [4] Chen Chen, Qifeng Chen, Jia Xu, and Vladlen Koltun. Learn-
image-only detector and a hybrid detector. The image-only ing to See in the Dark. In IEEE Conference on Computer
detector utilizes images as the input to identify fake images, Vision and Pattern Recognition (CVPR), pages 3291–3330.
while the hybrid detector leverages both image information IEEE, 2018. 5, 12
and the corresponding prompt information. In the testing [5] Qifeng Chen and Vladlen Koltun. Photographic Image Syn-
phase, if the hybrid detector cannot obtain the natural prompt thesis with Cascaded Refinement Networks. In IEEE Interna-
of an image, we take advantage of BLIP, an image caption- tional Conference on Computer Vision (ICCV), pages 1520–
ing model, to generate a prompt for the image. Our exten- 1529. IEEE, 2017. 12
sive experiments show that while an image-only detector can [6] Yunjey Choi, Min-Je Choi, Munyoung Kim, Jung-Woo Ha,
achieve strong performance on certain text-to-image genera- Sunghun Kim, and Jaegul Choo. StarGAN: Unified Genera-
tion models, a hybrid detector can always have better perfor- tive Adversarial Networks for Multi-Domain Image-to-Image
mance. These results demonstrate that fake images generated Translation. In IEEE Conference on Computer Vision and Pat-
by different text-to-image generation models share common tern Recognition (CVPR), pages 8789–8797. IEEE, 2018. 5
features. Also, prompts can serve as an extra “anchor” to [7] Tao Dai, Jianrui Cai, Yongbing Zhang, Shu-Tao Xia, and Lei
help the detector better differentiate between fake and real Zhang. Second-Order Attention Network for Single Image
images. Super-Resolution. In IEEE Conference on Computer Vision
To tackle the second research question, we conduct the and Pattern Recognition (CVPR), pages 11065–11074. IEEE,
2019. 5, 12
fake image attribution to attribute fake images from differ-
ent text-to-image generation models to their source models. [8] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina
Similarly, we develop two types of multi-class classifiers: an Toutanova. BERT: Pre-training of Deep Bidirectional Trans-
formers for Language Understanding. In Conference of the
image-only attributor and a hybrid attributor. Empirical re-
North American Chapter of the Association for Computa-
sults show that both image-only attributor and hybrid attrib-
tional Linguistics: Human Language Technologies (NAACL-
utor have good performance in all cases. This implies that HLT), pages 4171–4186. ACL, 2019. 11
fake images generated by different text-to-image generation
[9] Pierre L. Dognin, Igor Melnyk, Youssef Mroueh, Inkit Padhi,
models enjoy different properties, which can also be viewed
Mattia Rigotti, Jarret Ross, Yair Schiff, Richard A. Young, and
as fingerprints. Brian Belgodere. Image Captioning as an Assistive Technol-
Finally, we address the third research question, i.e., which ogy: Lessons Learned from VizWiz 2020 Challenge. Journal
kinds of prompts are more likely to generate authentic im- of Artificial Intelligence Research, 2022. 7
ages? We study the properties of prompts from semantic and [10] Kreiss Elisa, Goodman Noah D, and Potts Christopher. Con-
structural perspectives. From the semantic perspective, we cadia: Tackling Image Accessibility with Descriptive Texts
show that prompts with the topic “person” can achieve more and Context. CoRR abs/2104.08376, 2021. 7
authentic fake images compared to prompts with other top- [11] Martin Ester, Hans-Peter Kriegel, Jörg Sander, and Xiaowei
ics. From the structural perspective, our experiments reveal Xu. A Density-Based Algorithm for Discovering Clusters in
that prompts with lengths ranging from 25 to 75 allow text- Large Spatial Databases with Noise. In International Confer-
to-image generation models to create more authentic fake im- ence on Knowledge Discovery and Data Mining (KDD), pages
ages. 226–231. AAAI, 1996. 12
Overall, this work presents the first comprehensive study [12] Sharath Girish, Saksham Suri, Sai Saketh Rambhatla, and
of detecting and attributing fake images generated by state- Abhinav Shrivastava. Towards Discovery and Attribution of
of-the-art text-to-image generation models. As our empirical Open-World GAN Generated Images. In IEEE International
results are encouraging, we believe our detectors and attribu- Conference on Computer Vision (ICCV), pages 14094–14103.
IEEE, 2021. 2, 12
tors can play an essential role in mitigating the threats caused
by fake images created by the advanced generation models. [13] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing
We will share our code to facilitate research in this field in Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and
the future. Yoshua Bengio. Generative Adversarial Nets. In Annual Con-
ference on Neural Information Processing Systems (NIPS),
pages 2672–2680. NIPS, 2014. 1, 12
[14] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
References Deep Residual Learning for Image Recognition. In IEEE Con-
[1] James Atwood and Don Towsley. Diffusion-Convolutional ference on Computer Vision and Pattern Recognition (CVPR),
Neural Networks. In Annual Conference on Neural Infor- pages 770–778. IEEE, 2016. 4, 8
mation Processing Systems (NIPS), pages 1993–2001. NIPS, [15] Qicheng Lao, Mohammad Havaei, Ahmad Pesaranghader,
2016. 12 Francis Dutil, Lisa Di-Jorio, and Thomas Fevens. Dual Adver-
13
sarial Inference for Text-to-Image Synthesis. In IEEE Interna- [28] Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay
tional Conference on Computer Vision (ICCV), pages 7566– Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour,
7575. IEEE, 2019. 12 Burcu Karagol Ayan, S. Sara Mahdavi, Rapha Gontijo Lopes,
[16] Junnan Li, Dongxu Li, Caiming Xiong, and Steven C. H. Hoi. Tim Salimans, Jonathan Ho, David J. Fleet, and Mohammad
BLIP: Bootstrapping Language-Image Pre-training for Uni- Norouzi. Photorealistic Text-to-Image Diffusion Models with
fied Vision-Language Understanding and Generation. CoRR Deep Language Understanding. CoRR abs/2205.11487, 2022.
abs/2201.12086, 2022. 2, 5 12
[17] Ke Li, Tianhao Zhang, and Jitendra Malik. Diverse Image [29] Shibani Santurkar, Yann Dubois, Rohan Taori, Percy Liang,
Synthesis From Semantic Layouts via Conditional IMLE. In and Tatsunori Hashimoto. Is a Caption Worth a Thousand Im-
IEEE International Conference on Computer Vision (ICCV), ages? A Controlled Study for Representation Learning. CoRR
pages 4219–4228. IEEE, 2019. 12 abs/2207.07635, 2022. 7
[18] Wenbo Li, Pengchuan Zhang, Lei Zhang, Qiuyuan Huang, Xi- [30] Christoph Schuhmann, Richard Vencu, Romain Beaumont,
aodong He, Siwei Lyu, and Jianfeng Gao. Object-Driven Text- Robert Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo
To-Image Synthesis via Adversarial Training. In IEEE Con- Coombes, Jenia Jitsev, and Aran Komatsuzaki. LAION-
ference on Computer Vision and Pattern Recognition (CVPR), 400M: Open Dataset of CLIP-Filtered 400 Million Image-
pages 1274–12182. IEEE, 2019. 12 Text Pairs. CoRR abs/2111.02114, 2021. 3
[19] Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James Hays, [31] Douglas M. Souza, Jonatas Wehrmann, and Duncan D. Ruiz.
Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Efficient Neural Architecture for Text-to-Image Synthesis. In
Zitnick. Microsoft COCO: Common Objects in Context. International Joint Conference on Neural Networks (IJCNN),
In European Conference on Computer Vision (ECCV), pages pages 1–8. IEEE, 2020. 12
740–755. Springer, 2014. 2, 3 [32] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-
[20] Hodosh Micah, Young Peter, and Hockenmaier Julia. Framing reit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia
Image Description as a Ranking Task: Data, Models and Eval- Polosukhin. Attention is All you Need. In Annual Confer-
uation Metrics. Journal of Artificial Intelligence Research, ence on Neural Information Processing Systems (NIPS), pages
2013. 7 5998–6008. NIPS, 2017. 3
[21] Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav [33] Sheng-Yu Wang, Oliver Wang, Richard Zhang, Andrew
Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Owens, and Alexei A. Efros. CNN-Generated Images Are
Mark Chen. GLIDE: Towards Photorealistic Image Genera- Surprisingly Easy to Spot... for Now. In IEEE Conference
tion and Editing with Text-Guided Diffusion Models. CoRR on Computer Vision and Pattern Recognition (CVPR), pages
abs/2112.10741, 2021. 2, 3, 12 8692–8701. IEEE, 2020. 2, 5, 12
[22] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya [34] Zixu Wang, Zhe Quan, Zhi-Jie Wang, Xinjian Hu, and
Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Yangyang Chen. Text to Image Synthesis With Bidirectional
Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Generative Adversarial Network. In International Conference
Krueger, and Ilya Sutskever. Learning Transferable Visual on Multimedia and Expo (ICME), pages 1–6. IEEE, 2020. 12
Models From Natural Language Supervision. In International [35] Peter Young, Alice Lai, Micah Hodosh, and Julia Hocken-
Conference on Machine Learning (ICML), pages 8748–8763. maier. From image descriptions to visual denotations: New
PMLR, 2021. 2, 3, 7 similarity metrics for semantic inference over event descrip-
[23] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, tions. Transactions of the Association for Computational Lin-
and Mark Chen. Hierarchical Text-Conditional Image Gener- guistics, 2014. 2, 3
ation with CLIP Latents. CoRR abs/2204.06125, 2022. 2, 3, [36] Ning Yu, Larry Davis, and Mario Fritz. Attributing Fake Im-
12 ages to GANs: Learning and Analyzing GAN Fingerprints. In
[24] Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, IEEE International Conference on Computer Vision (ICCV),
Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. pages 7555–7565. IEEE, 2019. 2, 8, 12
Zero-Shot Text-to-Image Generation. In International Confer- [37] Han Zhang, Jing Yu Koh, Jason Baldridge, Honglak Lee, and
ence on Machine Learning (ICML), pages 8821–8831. JMLR, Yinfei Yang. Cross-Modal Contrastive Learning for Text-
2021. 1, 12 to-Image Generation. In IEEE Conference on Computer Vi-
[25] Scott E. Reed, Zeynep Akata, Xinchen Yan, Lajanugen Lo- sion and Pattern Recognition (CVPR), pages 833–842. IEEE,
geswaran, Bernt Schiele, and Honglak Lee. Generative Ad- 2021. 12
versarial Text to Image Synthesis. In International Confer- [38] Han Zhang, Tao Xu, and Hongsheng Li. StackGAN: Text
ence on Machine Learning (ICML), pages 1060–1069. JMLR, to Photo-Realistic Image Synthesis with Stacked Generative
2016. 12 Adversarial Networks. In IEEE International Conference on
[26] Nils Reimers and Iryna Gurevych. Sentence-BERT: Sentence Computer Vision (ICCV), pages 5908–5916. IEEE, 2017. 12
Embeddings using Siamese BERT-Networks. In Conference [39] Xu Zhang, Svebor Karaman, and Shih-Fu Chang. Detecting
on Empirical Methods in Natural Language Processing and and Simulating Artifacts in GAN Fake Images. In IEEE In-
International Joint Conference on Natural Language Process- ternational Workshop on Information Forensics and Security
ing (EMNLP-IJCNLP), pages 3980–3990. ACL, 2019. 2, 11 (WIFS), pages 1–6. IEEE, 2019. 6
[27] Robin Rombach, Andreas Blattmann, Dominik Lorenz,
Patrick Esser, and Björn Ommer. High-Resolution Image Syn-
thesis with Latent Diffusion Models. In IEEE Conference
on Computer Vision and Pattern Recognition (CVPR), pages
10684–10695. IEEE, 2022. 1, 2, 3, 12
14