Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Fake Images Generated by Text-to-Image Generation Models

Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

DE-FAKE: Detection and Attribution of

Fake Images Generated by Text-to-Image Generation Models

Zeyang Sha1 Zheng Li1 Ning Yu2 Yang Zhang1

1 CISPA Helmholtz Center for Information Security 2 Salesforce Research

Abstract
Text-to-image generation models that generate images based
arXiv:2210.06998v2 [cs.CR] 9 Jan 2023

on prompt descriptions have attracted an increasing amount


of attention during the past few months. Despite their en-
couraging performance, these models raise concerns about
the misuse of their generated fake images. To tackle this
problem, we pioneer a systematic study on the detection and
attribution of fake images generated by text-to-image gener-
ation models. Concretely, we first build a machine learning
classifier to detect the fake images generated by various text-
to-image generation models. We then attribute these fake im-
ages to their source models, such that model owners can be
held responsible for their models’ misuse. We further in-
vestigate how prompts that generate fake images affect de- Fake image detection space Real image
tection and attribution. We conduct extensive experiments Fake image attribution space Text-to-image generation models
on four popular text-to-image generation models, including Prompts that can generate low-quality image
DALL·E 2, Stable Diffusion, GLIDE, and Latent Diffusion, Prompts that can generate High-quality image

and two benchmark prompt-image datasets. Empirical re-


sults show that (1) fake images generated by various models Figure 1: An illustration of our work, including fake image de-
can be distinguished from real ones, as there exists a com- tection, fake image attribution, and prompt analysis.
mon artifact shared by fake images from different models;
(2) fake images can be effectively attributed to their source
models, as different models leave unique fingerprints in their
generated images; (3) prompts with the “person” topic or a The high-quality synthetic images created by text-to-
length between 25 and 75 enable models to generate fake im- image generation models can be used for various purposes.
ages with higher authenticity. All findings contribute to the For instance, they can facilitate the materialization of a nov-
community’s insight into the threats caused by text-to-image elist’s envisioned scene, perform the automated generation of
generation models. We appeal to the community’s consid- illustrations for advertising campaigns, and create physical
eration of the counterpart solutions, like ours, against the scenes that cannot be captured photographically. However,
rapidly-evolving fake image generation. these synthetic images also pose severe threats to society.
For instance, such images can be used by malicious parties
to disseminate misinformation. As reported by TechCrunch,
1 Introduction Stable Diffusion is able to generate realistic images, e.g.,
Text-to-image generation models have made tremendous images on the war in Ukraine, that may be used for propa-
progress during the past few months. State-of-the-art models ganda.1 Also, these images can jeopardize the art industry.
in this field, like Stable Diffusion [27] and DALL·E 2 [24], BBC reported that some fake artworks generated by text-to-
are able to generate high-quality images ranging from art- image generation models won first place in an art competi-
works to photorealistic news illustrations. Traditional image tion, which caused the complaints of the involved artists.2
generation models, such as generative adversarial networks
(GANs) [13], generate synthetic/fake images with latent code
sampled from a Gaussian distribution. On the other hand, 1 https://techcrunch.com/2022/08/12/a-startup-wants-to-
text-to-image generation models require users to provide tex- democratize-the-tech-behind-dall-e-2-consequences-be-
tual inputs, namely prompts, and generate images that match damned/.
the prompts. 2 https://www.bbc.com/news/technology-62788725.

1
1.1 Our Contributions Different from RQ1 and RQ2, RQ3 focuses on the impact
There are multiple approaches to alleviate the concerns of prompts on the authenticity of generated images. To this
brought by advanced generation models. In particular, one end, we conduct prompt analysis from semantic and struc-
can build a detector to detect whether an image is real or tural perspectives. In the former, we design two semantic
fake automatically. Moreover, one can build an attribution extraction methods to analyze the impact of prompt topics
model to attribute a synthetic image to its source generation on the authenticity of fake images. More specifically, the
model, such that the model owner can be held responsible for first one directly uses the ground truth topics provided in the
the model’s misuse. So far, various efforts have been made dataset for each prompt, and the second one automatically
in this field; however, they only focus on traditional genera- clusters the various prompts into different groups and ex-
tion models, represented by GANs [12, 33, 36]. To the best tracts topics from these groups. From the structural perspec-
of our knowledge, no study has been done on text-to-image tive, we conduct the study based on the length of prompts
generation models. Also, whether the prompts used in such and the proportion of nouns in prompts, respectively. Fig-
models can facilitate fake image detection and attribution re- ure 1 presents an overview of our methods to address the
mains unexplored. three research questions.
In this work, we present the first study on the detection and Evaluation. We perform experiments on two bench-
attribution of fake images generated by text-to-image gener- mark prompt-image datasets including MSCOCO [19] and
ation models. Concretely, we formulate the following three Flickr30k [35], and four popular pre-trained text-to-image
research questions (RQs). generation models including Stable Diffusion [27], Latent
Diffusion [27], GLIDE [21], and DALL·E 2 [23].
• RQ1. Can we differentiate the fake images gener- In fake image detection, extensive experimental results
ated by various text-to-image generation models from show that image-only detectors can achieve good perfor-
the real ones, i.e., detection of fake and real images? mance in some cases, while hybrid detectors can always
achieve better performance in all cases. For example, on
• RQ2. Can we attribute the fake images to their source MSCOCO [19], the image-only detector trained on fake im-
text-to-image generation models, i.e., attribution of fake ages generated by Stable Diffusion can achieve an accuracy
images to their sources? of 0.834 in differentiating fake images generated by Latent
Diffusion from the real ones, while it can only achieve 0.613
• RQ3. What kinds of prompts are more likely to gener- and 0.554 on GLIDE and DALL·E 2, respectively. Encour-
ate authentic images? agingly, the hybrid detector trained on fake images from Sta-
ble Diffusion achieves 0.932/0.899/0.885 accuracy on La-
Methodology. To differentiate fake images from real ones tent Diffusion/GLIDE/DALL·E 2 with natural prompts, and
(RQ1), i.e., fake image detection, we train a binary clas- 0.945/0.909/0.891 accuracy with BLIP-generated prompts.
sifier/detector. To validate the generalizability of the de- These results demonstrate that fake images from various
tector, we especially train it on fake images generated by models can indeed be distinguished from real images. We
only one model and evaluate it on fake images generated by further extract a common feature from fake images gener-
many other models. We consider two detection methods, i.e., ated by various models in Section 3.4, which implies the ex-
image-only and hybrid, depending on the detector’s knowl- istence of a common artifact shared by fake images across
edge. The image-only detector makes its decision solely various models.
based on the image itself. The hybrid detector considers both In fake image attribution, our experiments show that both
images and their corresponding prompts. Hybrid detection is image-only and hybrid attributors can achieve good perfor-
a brand-new detection method, and it is designed specifically mance in all cases. Similarly, the hybrid attributor is better
for detecting fake images created by text-to-image generation than the image-only one. For instance, the image-only at-
models. Concretely, we leverage the image and text encoders tributor can achieve an accuracy of 0.815 in attributing fake
of the CLIP model [22] to transfer an image and its prompt images to the models we consider, while the hybrid attributor
to two embeddings which are then concatenated as the input can achieve 0.880 with natural prompts and 0.850 with BLIP-
to the detector. Note that in the prediction phase, an image’s generated prompts. These results demonstrate that fake im-
natural prompt may not be available. In such cases, we lever- ages can indeed be attributed to their corresponding text-to-
age an image captioning model BLIP [16] to generate the image generation models. We further show the unique fea-
prompt for the image. ture extracted from each model in Section 4.4, which implies
To attribute a fake image to its source model (RQ2), we that different models leave unique fingerprints in the fake im-
propose fake image attribution by training a multi-class clas- ages they generate.
sifier (instead of a binary classifier), and we name this clas- In prompt analysis, we first find that prompts with the top-
sifier as an attributor. Specifically, the attributor is trained on ics of “skis,” and “snowboard” tend to generate more au-
fake images generated by multiple text-to-image generation thentic images through our first semantic extraction method,
models. Fake images from the same model are labeled as the which relies on the ground truth information from the dataset.
same class. Moreover, we also establish the attributor by two However, by clustering various prompts over embeddings
methods, i.e., image-only and hybrid, which are the same as by sentence transformer [26], we find that prompts with the
the detector to address RQ1. “person” topic can actually generate more authentic images.

2
Upon further inspection, we discover that most of the images from both images and prompts. In our experiments,
associated with “skis” and “snowboard” are also related to we adopt a Pytorch version of DALL·E 2 released by
“person.” These results indicate that prompts with the topic DALL·E 2-Pytorch.6 DALL·E 2-Pytorch uses an extra
“person” are more likely to generate authentic fake images. layer of indirection with the prior network and is trained
From the structural perspective, our experiments show that on a subset of LAION-5B [30]. Note that we refer to
prompts with a length between 25 and 75 enable text-to- DALL·E 2-Pytorch as DALL·E 2 in the rest of our pa-
image generation models to generate fake images with higher per.
authenticity, while the proportion of nouns in the prompt has
no significant impact. 2.2 Datasets
Implications. In this paper, we make the first attempt to We use the following two benchmark prompt-image datasets
tackle the threat caused by fake images generated by text- to conduct our experiments.
to-image generation models. Our results on detecting fake
images and attributing them to their source models are en- • MSCOCO [19]. MSCOCO is a large-scale objective,
couraging. This suggests that our solution has the potential segmentation, and captioning dataset. It is a standard
to play an essential role in mitigating the threats. We will benchmark dataset for evaluating the performance of
share our source code with the community to facilitate re- computer vision models. MSCOCO contains 328,000
search in this field in the future. images distributed in 80 classes of natural objects, and
each image in MSCOCO has several corresponding cap-
tions, i.e., prompts. In this work, we consider the first
2 Preliminaries 60,000 prompts due to the constraints of our lab’s com-
2.1 Text-to-Image Generation Models putational resources.
During the past few months, text-to-image generation mod- • Flickr30k [35]. Flickr30k is a widely used dataset
els have attracted an increasing amount of attention. A model for research on image captioning, language understand-
in this domain normally takes a prompt, i.e., a piece of text, ing, and multimodal learning. It contains 31,783 images
and a random noise as the input and then denoises the im- and 158,915 English prompts on various scenarios. All
age under the guidance of the prompt so that the generated images are collected from the Flickr website, and the
image matches the description. In this work, we focus on prompts are written by Flickr users in natural language.
four popular text-to-image generation models that are pub-
licly available online. Note that the prompts from these datasets are also important
as they will be used to generate fake images or serve as inputs
• Stable Diffusion [27]. Stable Diffusion (SD) is a dif- for the hybrid classifiers (see Section 3 for more details).
fusion model for text-to-image generation. The avail- In summary, the text-to-image generation models and
able model we use3 is pre-trained on 512×512 images datasets we consider in this work are listed in Table 1. Since
from a subset of the LAION-5B [30] dataset. The CLIP different models are trained on images of different sizes and
model’s [22] text encoder is used to condition the model fake images usually appear in the real world at different res-
on prompts. olutions. Therefore, we adopt the default settings of these
available models and perform experiments on fake images of
• Latent Diffusion [27]. Latent Diffusion (LD) is also a different sizes.
diffusion model for text-to-image generation. The avail-
able model we use4 is pre-trained on LAION-400M, 3 Fake Image Detection
a smaller dataset sampled from LAION-5B [30]. La-
In this section, we present our fake image detector to differ-
tent Diffusion also leverages the text encoder of CLIP
entiate fake images from real ones (RQ1). We start by intro-
to guide the direction of fake images.
ducing our design goals. Then, we present how to construct
• GLIDE [21]. GLIDE is a text-to-image generation the detector. Finally, we show the experimental results.
model proposed by OpenAI. The available model5 is
trained on a filtered version of a dataset comprised of
3.1 Design Goals
several hundred million prompt-image pairs. In addi- To tackle the threats posed by the misuse of various text-to-
tion, GLIDE is not good at understanding prompts that image generation models, the design of our detector should
contain “person” topics, as such images have been re- follow the following points.
moved from the training dataset due to ethical concerns.
• Differentiating Between Fake and Real Images. The
• DALL·E 2 [23]. DALL·E 2 is one of the most popular primary goal of the detector is to effectively differenti-
text-to-image generation models proposed by OpenAI. ate fake images generated by text-to-image generation
A transformer [32] is used to capture the information models from real ones. Successful detection of fake im-
3 https://github.com/CompVis/stable-diffusion.
ages can reduce the threat posed by the misuse of these
4 https://github.com/CompVis/latent-diffusion. advanced models.
5 https://github.com/openai/glide-text2im. 6 https://github.com/lucidrains/DALLE2-pytorch.

3
Table 1: The text-to-image generation models, datasets, and the
number/size of fake images we consider in this work. Note that
the number of fake images from DALL·E 2 being low is due to
CNN
its poor image generation efficiency.

Model Dataset Images Image Size


MSCOCO 59,247 512×512
SD CLIP Image
Flickr30k 13,231 512×512 Encoder

+
MSCOCO 31,276 256×256 A man is in a kitchen
Classi cation
LD making pizzas Layer
Flickr30k 17,969 256×256
MSCOCO 41,685 256×256 CLIP Text
GLIDE Encoder
Flickr30k 27,210 256×256 A man is in a kitchen
making pizzas
MSCOCO 1,028 256×256 Real images
DALL·E 2 Image-Only detection Image embedding
Flickr30k 300 256×256
Hybrid detection Prompt embedding Fake images
generated by SD
Fake images generated by other text-to-image
generation models Input
• Agnostic to Models and Datasets. As text-to-image
generation models have undergone rapid development, Figure 2: An illustration of fake image detection. The red part
it is likely that more advanced models will be proposed describes image-only detection. The green part describes hy-
in the future. As a result, it is difficult for us to collect brid detection. The blue part describes fake images generated
all text-to-image generation models to build our detec- by other text-to-image generation models.
tor. Moreover, building the detector on various models fi

(even though we can collect many) inevitably leads to


more resource consumption. Therefore, it is crucial to for the next stage. Then, we use the prompts of these
explore whether our detection based on very few text-to- 20,000 images to query one model (we choose SD here)
image generation models is generalizable to other mod- to get 20,000 fake images. In this way, our fake im-
els. Also, since we have no knowledge of the distribu- ages are from one text-to-image generation model given
tion of prompts used to generate fake images, it is also prompts from one dataset, referred to as SD+MSCOCO.
important for our detector to identify fake images gen-
erated by prompts from other prompt-image datasets. • Dataset Construction. We label all fake images as
0 and all real images as 1. We then create a balanced
3.2 Methodology training dataset containing a total of 40,000 images.
To achieve the primary goal of differentiating fake images • Detector Construction. We build the detector (i.e., a
from real ones, we construct a detector by training a binary binary classifier) that accepts an image as input and out-
classifier. Furthermore, to make our detector agnostic to un- puts a binary prediction, i.e., 0-fake or 1-real. Lastly, we
seen models and datasets, we consider a more realistic and train the detector from scratch with fake and real images
challenging scenario where the detector can collect fake im- in conjunction with classical training techniques. Note
ages generated by only one text-to-image generation model that we use ResNet18 [14] as our image-only detector’s
given prompts from one dataset. The detector then trains architecture.
its binary classifier on these fake/real images and evaluates
its generalizability on fake images from other models and After we have trained the detector, we evaluate the gen-
datasets. eralizability of the trained detector on images from other
In addition, based on the background knowledge available models, i.e., LD, GLIDE, and DALL·E 2, given prompts
to the detector, we propose two different approaches to estab- from the other dataset, i.e., Flickr30k. For completeness, we
lish the detector, namely image-only and hybrid. The image- also include the detection results on fake images from the
only detector accepts only images as input. In contrast, the same model and/or the same dataset. Table 1 shows the to-
hybrid detector accepts both images and their corresponding tal number of fake images generated by four models and two
prompts as input. See Figure 2 for an illustration of how to datasets. Besides the 20,000 images (out of 59,247) from
conduct fake image detection. SD+MSCOCO, which are used to train the detector, all the
Image-Only Detection. The red part of Figure 2 shows others are used to test the performance of the detector. Note
the work pipeline of our image-only detector. The process that in all cases, we sample the same number of real images
of training our image-only detector can be divided into three as the fake ones for training and testing the detector (and the
stages, namely, data collection, dataset construction, and de- attributor in Section 4).
tector construction. Hybrid Detection. We now present the hybrid detec-
tor, which considers both images and their corresponding
• Data Collection. We first randomly sample 20,000 prompts. This is motivated by the observation that real im-
images from MSCOCO and treat them as real images ages always carry a wide range of contents that the prompts

4
1.0 1.0
cannot fully and faithfully describe. However, since fake im-
ages are generated based on prompts, they may not contain 0.8 0.8

additional content beyond what is described, i.e., not as in-

Accuracy

Accuracy
0.6 0.6
formative as real images. Therefore, introducing prompts to-
0.4 Forensic classifier 0.4
gether with images enlarges the disparity between fake and Image-Only
0.2 0.2
real images, which in our opinion can contribute to differen- Hybrid (natural prompts)
Hybrid (generated prompts)
tiating between the two. We further show in Section 3.4 that 0.0
SD LD GLIDE DALL·E 2
0.0
SD LD GLIDE DALL·E 2
the disparity between real and fake images is indeed huge
from the prompt’s perspective. Note that using prompts as an (a) MSCOCO (b) Flickr30k
extra signal for fake image detection is novel and unique to
Figure 3: The performance of the forensic classifier and de-
text-to-image generation models, as prompts do not partici- tectors. We conduct the evaluation on (a) MSCOCO and (b)
pate in the image generation process of traditional generation Flickr30k, respectively.
models, like GANs.
The green part of Figure 2 shows the work pipeline of our
hybrid detection. Specifically, the process of training our hy- and low-level vision models [4, 7], as a baseline. The au-
brid detector can also be divided into three stages, i.e., data thors of [33] name their classifier as forensic classifier. Note
collection, dataset construction, and detector construction. that this forensic classifier is the state-of-the-art fake image
detector for generation models, and the authors show that it
• Data Collection. To collect the real and fake images, has strong generalizability. For instance, it can achieve an
we follow the same step as the first step for the image- accuracy of 0.946 on differentiating fake images generated
only detector. by StarGAN [6], which is not considered during the model
training, from real images.
• Dataset Construction. Since our hybrid detector takes
Figure 3 depicts the evaluation results. First of all, we
images and prompts as input, we label all fake images
can observe that the forensic classifier cannot effectively dis-
and their corresponding prompts as 0 and label real im-
tinguish fake images (generated by text-to-image generation
ages and their corresponding prompts as 1. Similarly,
models) from real ones. In all cases, the forensic classifier
we then create a training dataset containing a total of
only achieves an accuracy of 0.5, which is equivalent to a
40,000 prompt-image pairs.
random guess. Based on this observation, we can conclude
• Detector Construction. To exploit the prompt infor- that the forensic classifier cannot be generalized to text-to-
mation, we take advantage of CLIP’s image encoder and image generation models. We attribute this observation to
text encoder as feature extractors to obtain high-level the differences between traditional generation models and
embeddings of images and prompts. Then, we concate- text-to-image generation models. This result also prompts
nate image embeddings and text embeddings together the urgent need for counterpart solutions against the misuse
as new embeddings and use these embeddings to train a of text-to-image generation models.
binary classifier, i.e., a 2-layer multilayer perceptron, as Furthermore, we can observe that the image-only detec-
our detector. tor performs much better in all cases than the forensic clas-
sifier. For example, the image-only detector can achieve
To evaluate the trained hybrid detector, we need both im- an accuracy of 0.871 in distinguishing fake images gener-
ages and their corresponding prompts. Typically, a user may ated by LD+Flickr30k (querying the prompts of Flickr30k to
attach a description to an image they post on the Internet. LD) from real images. We emphasize here that the image-
Therefore, we can directly consider this attached description only detector is trained only on fake images generated by
as the prompt for the image. In our experiments, we adopt SD+MSCOCO and has never seen fake images generated by
the original/natural prompts from the dataset to conduct the other models given prompts from other datasets. We conjec-
evaluation. ture that this is due to some common properties shared by all
In a more realistic and challenging scenario where the de- fake images generated by text-to-image generation models
tector cannot obtain the natural prompts, we propose a sim- (see Section 3.4 for more details).
ple yet effective method to generate the prompts ourselves. Lastly, another interesting finding is the much larger vari-
Concretely, we leverage the BLIP [16] model (an image cap- ation in detection performance due to the effect of the model
tioning model) to generate captions for the queried images compared to the effect of the dataset. E.g., in Figure 3a, the
and then regard these generated captions as the prompts for image-only detector achieves an accuracy of 0.913 on SD but
the images. only 0.526 on DALL·E 2. In contrast, comparing Figure 3a
and Figure 3b, the image-only detector achieves very close
3.3 Results accuracy on different datasets over all text-to-image gener-
We now present the performance of our proposed image-only ation models. We attribute this observation to the unique
detection and hybrid detection for fake image detection. fingerprint of fake images generated by text-to-image gen-
Image-Only Detection. For a convincing evaluation, we eration models (see Section 4.4).
adopt the existing work [33] on detecting fake images gener- Hybrid Detection. Although the image-only detector
ated by various types of generation models, including GANs achieves better performance in all cases compared to the

5
1000
Fake
Real
800

Number of samples
600

400

200

(a) Real (b) Fake 0


0.0 0.2 0.4 0.6 0.8 1.0
Probability
Figure 4: The visualization of frequency analysis on (a) real im-
ages and (b) fake images. Figure 5: The probability distribution of the connection be-
tween the real/fake images and the corresponding prompts.

forensic classifier, we acknowledge that the current detec-


tion performance is far from the design goal due to the lack ones. We conjecture that there exist some common properties
of good performance on other models, such as GLIDE and shared by fake images from various text-to-image generation
DALL·E 2. As mentioned earlier, using prompts as an extra models. We verify this conjecture by visualizing the com-
signal may boost the fake image detection performance. mon artifact shared across fake images. Besides, based on
We report the performance of our proposed hybrid detec- the better performance achieved by hybrid detection, we fur-
tion in Figure 3. First, we can find that the hybrid detec- ther explore why additional prompt information can enhance
tor can always achieve much better performance than the detection performance. In the end, we also test whether our
image-only detector, especially on models like GLIDE and trained detector can be directly applied to fake images from
DALL·E 2. For instance, the hybrid detector with natural other domains, in particular, fake artwork detection.
prompts can achieve an accuracy of 0.909 on DALL·E 2
+MSCOCO, which is much higher than the 0.522 achieved Artifact Visualization. Inspired by Zhang et al. [39],
by the image-only detector. Moreover, even without natural we draw the frequency spectra of fake and real images.
prompts, the hybrid detector with BLIP-generated prompts For the four text-to-image generation models we consider in
can still have a strong performance. For example, on fake this work, we randomly select 1,000 fake images from each
images generated by GLIDE+MSCOCO, the hybrid detec- model given prompts from MSCOCO. In total, we have ob-
tor with natural prompts achieves an accuracy of 0.891, tained 4,000 fake images. Also, we collect 4,000 real images
and encouragingly, the hybrid detector with BLIP-generated of the same prompts from MSCOCO. We then calculate the
prompts also achieves a high accuracy of 0.838. These re- average of Fourier transform outputs of real and fake images,
sults indicate that introducing prompts together with images respectively. We leverage Fourier transform here due to its
can indeed enlarge the disparity between fake and real im- ability to reveal latent features of the given images.
ages, which is beneficial to fake image detection. We further As shown in Figure 4, we can clearly observe that there are
investigate in more depth why using the prompt as a new sig- distinct patterns in real and fake images. Concretely, the cen-
nal can improve detection performance (see Section 3.4 for tral region of the fake image has higher brightness and more
detailed information). concentrated frequency spectra. This observation verifies the
Besides, we can find that the performance of the hybrid existence of the common artifact shared by the fake images
detector on other models is much less influenced by prior generated by various text-to-image generation models.
knowledge of the known model than the image-only detec- Why Does Prompt Enhance Detection Performance. We
tor. For example, on the MSCOCO dataset, the hybrid detec- conduct a more in-depth study on why using prompts as a
tor with natural prompts can achieve an accuracy of 0.958 on new signal can improve detection performance. As men-
SD, while the accuracy only drops to 0.909 on DALL·E 2. tioned before, a prompt cannot completely reflect the con-
We can also find that the hybrid detector is not influenced tents of a real image. Meanwhile, a fake image is purely
much by the dataset, similar to the image-only detector. For based on the prompt information. This suggests the connec-
instance, on SD, the hybrid detector with generated prompts tion between a fake image and its prompt is stronger than
can achieve quite a similar accuracy between MSCOCO and the connection between a real image and its prompt. This
FLickr30k (0.930 vs. 0.904). These results show that our pro- is essentially the reason why the hybrid detector has a better
posed hybrid detector is strong regarding model and dataset performance than the image-only detector.
independence. To verify this, we first randomly sample 2,000 prompts
from MSCOCO. For each prompt, we collect its correspond-
3.4 Discussion ing real image from the dataset and let SD generate a fake
The above results fully demonstrate the effectiveness of our image for it. Then, we rely on CLIP’s text encoder to trans-
fake image detection. Next, we delve more deeply into the fer the prompt to an embedding and CLIP’s image encoder
reasons for successfully distinguishing fake images from real to transfer the real and fake images to two embeddings, re-

6
spectively. Then, we calculate two cosine similarities, one 1.0
0.8
is between the prompt’s embedding and its real image’s em- 0.8

bedding, and the other is between the prompt’s embedding 0.6

Accuracy

Accuracy
0.6

and its fake image’s embedding. Finally, the two cosine sim- 0.4 SD 0.4 SD
LD LD
ilarities are normalized into a probability distribution via a 0.2
GLIDE
0.2
GLIDE
softmax function [22]. Higher probability implies a stronger 0.0 DALL·E 2 0.0 DALL·E 2

connection between the image and the prompt. Figure 5 −0.50 −0.25 0.00
Descriptiveness
0.25 0.50 −0.50 −0.25 0.00 0.25
Descriptiveness
0.50

shows the similarity distribution between the 2,000 prompts


and the real/fake images. We can see that the similarity be- (a) MSCOCO (b) Flickr30k
tween the fake image and the corresponding prompt is closer
Figure 6: The performance of hybrid detectors with generated
than that between the real image and the same prompt, lead- prompts in terms of the prompts’ descriptiveness. The descrip-
ing to a clear gap in the similarity distribution between fake tiveness is grouped into five equally sized bins.
and real images. This verifies our aforementioned intuition.
Furthermore, we can also conclude that it is not the prompt
information itself that enhances the performance of the detec- ity demonstrates the degree of match between the generated
tor, but the prompt information can be exploited as an extra prompts and the given images. Figure 6 depicts the relation
“anchor” to provide a new signal to distinguish between real between the detection performance and the descriptiveness of
and fake images. Such signals can be effectively captured by the generated prompts. We can see that in general, higher de-
a multilayer perceptron. scriptiveness leads to better detection performance. Also, af-
Case Study of Artwork. So far, all the previous fake im- ter a certain descriptiveness value, the detection performance
ages we have studied are related to MSCOCO and Flickr30k, becomes stable across all models and datasets. This shows
which are about natural objects. The experimental results the robustness of using BLIP-generated prompts in our hy-
show that our proposed fake image detection can achieve ex- brid detector.
cellent performance in differentiating these fake images from Impact of Training Dataset Size. In this section, we
the real ones. However, text-to-image generation models can explore the impact of the training dataset’s size on the per-
also be used to generate other types of images, especially formance of our proposed fake image detection. More con-
fake artwork. Therefore, it is interesting to see whether our cretely, for each text-to-image generation model, we train the
proposed fake image detection can be directly used to distin- detector on fake images from SD+MSCOCO by varying the
guish between real and fake artworks. size of the training dataset from 500 to 40,000 (half is real,
Since there do not exist many datasets on artworks, we col- half is fake). Note that the default size we use in the previous
lect 50 real artworks and 50 fake artworks generated by SD evaluation is 40,000.
from the Internet. Besides, since there are no corresponding We report the detection performance in terms of the train-
prompts for these collected artworks, we adopt the image- ing dataset size in Figure 7. As expected, the performance
only detector and the hybrid detector with generated prompts of different detectors is indeed affected by the size of the
by BLIP to conduct the evaluation. Note that both detectors training dataset, and the general trend is that all the detec-
we adopt have been trained in previous experiments based on tors perform better with the increase in the training dataset
SD+MSCOCO. The experiments show that our proposed de- size. For instance, as shown in Figure 7b, when the train-
tectors can still achieve good performance in differentiating ing dataset size is 1,000, the hybrid detector can achieve an
fake artworks from real ones. For instance, the image-only accuracy of 0.792 while the accuracy can be improved to
detector achieves an accuracy of 0.710, and the hybrid de- 0.885 when the training dataset size is 40,000. More encour-
tector can achieve an accuracy of 0.690. The results, again, agingly, we can also find that the hybrid detector achieves
indicate that fake images of different styles (e.g., artworks strong performance even with a small training dataset of only
and natural objects) generated by text-to-image generation 500 images, which is much fewer than 40,000 images. For
models share common properties. example, in Figure 7a, the hybrid detector achieves a high
accuracy of 0.830 with only 500 training images. Finally,
3.5 Ablation Study we again find that the hybrid detector performs much bet-
ter than the image-only detector, even with different sizes of
Impact of Generated Prompt. In hybrid detection with the training dataset. For example, in Figure 7c, the hybrid
generated prompts, we rely on the BLIP model. Here, we detector achieves an accuracy of 0.714 with 500 training im-
explore whether the quality of the BLIP-generated prompts ages, while the image-only detector achieves only 0.523 with
affects the detection performance. To measure the qual- 40,000 training images. These results again demonstrate that
ity of the generated prompts by BLIP, we leverage a new introducing prompts together with images is beneficial to dif-
term called prompt descriptiveness [9, 10, 20, 29]. Prompt ferentiate between fake and real images.
descriptiveness can be quantitatively measured by comput-
ing the cosine similarity between a prompt’s embedding and 3.6 Takeaways
its image’s embedding generated by CLIP.7 Such similar-
In summary, to answer RQ1, we propose fake image detec-
7 Notethat the descriptiveness is the same as the one used in the previous tion by training a binary detector to differentiate fake images
analysis regarding why prompts can enhance detection performance. generated by text-to-image generation models from real im-

7
1.0 1.0 1.0 1.0

0.8 0.8 0.8 0.8


Accuracy

Accuracy

Accuracy

Accuracy
0.6 0.6 0.6 0.6

Image-Only
0.4 0.4 0.4 0.4
Hybrid (natural prompts)
Hybrid (generated prompts)
0.2 0.2 0.2 0.2
500 1000 5000 10000 20000 40000 500 1000 5000 10000 20000 40000 500 1000 5000 10000 20000 40000 500 1000 5000 10000 20000 40000
Train dataset size Train dataset size Train dataset size Train dataset size

(a) SD (b) LD (c) GLIDE (d) DALL·E 2

Figure 7: The performance of detectors in terms of the training dataset size on SD+MSCOCO. We conduct the evaluation on (a)
SD+MSCOCO, (b) LD+MSCOCO, (c) GLIDE+MSCOCO, and (d) DALL·E 2+MSCOCO.

ages. Specially, we propose two methods to construct the Similar to fake image detection, we propose two differ-
binary detector, namely image-only and hybrid. Our evalua- ent approaches to establish the attributor, namely image-only
tion shows that the fake images from various models can in- and hybrid. The image-only attributor accepts only images
deed be differentiated from the real ones. Moreover, the hy- as input, and the hybrid attributor accepts both images and
brid detector can obtain much better performance compared their corresponding prompts as input.
to the image-only detector, which demonstrates that intro- Image-Only Attribution. The process of establishing our
ducing prompts together with images can indeed amplify the image-only attributor can also be divided into three stages,
differences between fake and real images. namely, data collection, dataset construction, and attributor
construction.
4 Fake Image Attribution
• Data Collection. We first randomly sample 20,000
The previous section has shown that fake image detection, es- images from MSCOCO as real images. Then, we use
pecially the hybrid detection we have proposed, can achieve the prompts of these 20,000 images to query each model
remarkable performance. In this section, we explore whether to get 20,000 fake images accordingly. Here, we adopt
fake images generated by various text-to-image generation SD, LD, and GLIDE to generate fake images. In total,
models can be attributed to their source models, i.e., fake we have obtained 60,000 fake images. The reason we do
image attribution. We start by introducing our design goals. not consider DALL·E 2 is that we will use DALL·E 2 for
We then describe how to construct the fake image attributor. the experiments regarding adaptation to other models
Finally, we present the evaluation results. (see Section 4.4).
4.1 Design Goals • Dataset Construction. We label all real images
To attribute fake images to their source models, we follow as 0 and all fake images from the same model as the
two design goals. same class. Concretely, we label the fake images from
SD/LD/GLIDE as 1/2/3. We then create a training
• Tracking Sources of Fake Images. The primary goal dataset containing a total of 80,000 images with four
of fake image attribution is to effectively attribute differ- classes.
ent fake images to their source generation models. The
aim of attribution is to let a model owner be held re- • Attributor Construction. We build the fake image at-
sponsible for the model’s (possible) misuse. Previously, tributor, i.e., a multi-class classifier, that accepts images
fake image attribution has been studied in the context of as input and outputs the multi-class prediction, i.e., 0-
traditional generation models, like GANs [36]. real, 1-SD, 2-LD, or 3-GLIDE. We train the attributor
from scratch using the created training dataset in con-
• Agnostic to Datasets. In the real world, a fake image
junction with classical training techniques. Similar to
can be generated by a text-to-image generation model
the fake image detector, we leverage ResNet18 [14] as
based on a prompt from any distribution. Therefore, to
the attributor’s architecture.
be more practical, the attribution should be independent
of the prompt distribution. After we have trained the attributor, we evaluate the per-
formance of the trained attributor for attributing images
4.2 Methodology from various sources (i.e., real, SD, LD, and GLIDE) given
To attribute the fake images to their sources, we construct prompts from the other dataset (i.e., Flickr30k). For testing
fake image attribution by training a multi-class classifier, re- the attributor, we sample the same number of images for all
ferred to as an attributor, with each class corresponding to four classes, i.e., 10,000 each and 40,000 in total.
one model. As aforementioned, the attributor should be ag- Hybrid Attribution. The previous evaluation in fake-
nostic to datasets; thus, we establish the multi-class classifier image detection has demonstrated the superior performance
based on prompts from only one dataset, e.g., MSCOCO, and of the hybrid detector, verifying that introducing prompts to-
test it on prompts from other datasets like Flickr30k. gether with images can amplify the differences between fake

8
and real images. We now conduct the study to investigate Table 2: The performance of image-only attributors and hybrid
whether a similar enhancement can be observed in the case attributors.
of hybrid attribution.
The hybrid attributor is quite similar to the above image- MSCOCO Flickr30k
only attributor, which also consists of three stages, i.e., data Image-Only 0.864 0.863
collection, dataset construction, and attributor construction. Hybrid (natural prompts) 0.936 0.933
Hybrid (generated prompts) 0.903 0.892
• Data Collection. To collect the images from various
sources, we follow the same step as the first step for the
image-only attributor. hybrid attribution, no matter with or without natural prompts,
achieves better performance than image-only attribution re-
• Dataset Construction. Since our hybrid attributor gardless of the dataset. These results demonstrate once again
takes images and prompts as input, we label all real im- that fake images can be successfully attributed to their corre-
ages with their corresponding prompts as 0 and all fake sponding text-to-image generation models. Also, they verify
images from the same model with their corresponding that using prompts as an extra signal can improve attribution
prompts as the same class. Similarly, we then create performance.
a training dataset containing a total of 80,000 prompt-
images pairs with four classes. 4.4 Discussion
• Attributor Construction. To exploit the prompt in- The above evaluation demonstrates the effectiveness of our
formation, we again use CLIP’s image encoder and text fake image attribution. We conjecture that each text-to-image
encoder as feature extractors to obtain high-level em- generation model leaves a unique fingerprint in the fake im-
beddings of images and prompts. Then, we concate- ages it generates. Next, we verify this conjecture by visu-
nate image embeddings and text embeddings together alizing the fingerprints of different models. Besides, in the
as new embeddings and use these embeddings to train a previous evaluation, the training and testing images for our
multi-class classifier, which is also a 2-layer multilayer attributor are disjoint but generated by the same set of text-to-
perceptron, as our attributor. image generation models. We further explore how to adapt
our attributor to other models that are not considered during
In order to evaluate the trained hybrid attributor, we need training.
images and their corresponding prompts. We again con- Fingerprint Visualization. Similar to visualizing the
sider two scenarios here, one in which we can directly ob- shared artifact across fake images (see Section 3.4), we also
tain prompts for the images from the dataset and the other in draw the frequency spectra of different text-to-image genera-
which we can only generate prompts for the images relying tion models built on MSCOCO. For each text-to-image gen-
on BLIP. eration model, we randomly select 2,000 fake images and
then calculate the average of their Fourier transform outputs.
4.3 Results As shown in Figure 8, we can clearly observe that
In this section, we present the performance of our proposed there are distinct patterns in images generated by different
two types of fake image attribution. text-to-image generation models, especially in GLIDE and
DALL·E 2. We can also find that the frequency spectra of SD
Image-Only Attribution. We report the performance of
is similar to that of LD, which can explain why the image-
image-only attribution in Table 2. Note that the random guess
only detector built on SD can also achieve very strong per-
for the 4-class classification task is only 0.25. We can find
formance on LD (see Figure 3). The reason behind this is
that our proposed image-only attributor can achieve remark-
that SD and LD follow similar algorithms, although trained
able performance. For instance, the image-only attributor can
on different datasets and different model architectures. In
achieve an accuracy of 0.864 on images from various sources
conclusion, the qualitative evaluation verifies that each text-
given the prompts sampled from MSCOCO. These results in-
to-image generation model has its unique fingerprint.
dicate that the fake images can be effectively attributed to
their corresponding text-to-image generation models. We Adaptation to Unseen Models. In previous experiments,
further show the unique feature extracted from each model we evaluate attribution on fake images generated by mod-
in Section 4.4, which implies that different models may leave els considered during training. However, there are instances
unique fingerprints in the fake images they generate. when we encounter fake images that are not from models in-
Further, the image-only attributor can also achieve a high volved in training, i.e., unseen models. Next, we explore how
accuracy of 0.863 on images from various source models to adapt our attributor to unseen models.
given the prompts sampled from the other dataset Flickr30k. To this end, we propose a simple yet effective approach
Note that we construct the attributor based on MSCOCO named confidence-based attribution. The key idea is to at-
only. This result indicates that our proposed image-only at- tribute the unconfident samples from the attributor predic-
tribution is agnostic to datasets. tion, i.e., lower than a pre-defined threshold, to unseen mod-
els. Here, all unseen models are considered as one class.8 In
Hybrid Attribution. Table 2 also depicts the performance
of our proposed hybrid attribution. We can clearly see that 8 In the current version, our approach cannot differentiate fake images from

9
(a) SD (b) LD (c) GLIDE (d) DALL·E 2

Figure 8: The visualization of frequency analysis on fake images generated by (a) SD, (b) LD, (c) GLIDE, and (d) DALL·E 2.

1.0 1.0 1.0 1.0

0.9 0.9 0.9 0.9

Accuracy

Accuracy
Accuracy

Accuracy

0.8 0.8 0.8 0.8

0.7 0.7 0.7 0.7


Image-Only Image-Only
0.6 Hybrid (natural prompts) 0.6 0.6 Hybrid (natural prompts) 0.6
Hybrid (generated prompts) Hybrid (generated prompts)
0.5 0.5 0.5 0.5
0.0 0.2 0.4 0.6 0.8 0.0 0.2 0.4 0.6 0.8 5000 10000 20000 40000 80000 5000 10000 20000 40000 80000
Threshold Threshold Size Size

(a) MSCOCO (b) Flickr30k (a) MSCOCO (b) Flickr30k

Figure 9: The performance of attributors on an unseen dataset Figure 10: The performance of attributors in terms of the train-
DALL·E 2 in terms of different thresholds. We conduct the eval- ing dataset size on MSCOCO. We conduct the evaluation on (a)
uation on (a) MSCOCO and (b) Flickr30k. MSCOCO and (b) Flickr30k.

our evaluation, we treat DALL·E 2 as one unseen model (as


mentioned before). To find a suitable threshold, we have ex- performance improvement from 10,000 to 20,000 in training
perimented with values from 0 to 1 in a step of 0.1. Note that dataset size, while for image-only attribution, a similar im-
here we extend the evaluation from four classes to five: real, provement happens when the training dataset size increases
SD, LD, GLIDE, and unseen; the testing dataset is still bal- from 40,000 to 80,000. From this phenomenon, we can con-
anced. Also, the attributor remains unchanged, i.e., it is still clude that hybrid attribution achieves good performance even
a 4-class classifier. Figure 9 shows that both image-only and with a small amount of training data.
hybrid attributors can achieve good performance in all cases.
Encouragingly, the 0.9 threshold can lead to the best attribu-
tion performance. Moreover, we can still conclude that hy- 4.6 Takeaways
brid attribution can achieve better performance than image- In summary, to answer RQ2, we propose image-only attri-
only attribution in both settings. These results indicate that bution and hybrid attribution to track the source of fake im-
with a simple modification, our attribution can be adapted to ages. Empirical results indicate that fake images can be suc-
unseen models. cessfully attributed to their sources. We further conduct a
qualitative analysis that verifies the existence of unique fin-
4.5 Ablation Study gerprints left by different text-to-image generation models in
Impact of Training Dataset Size. Here, we explore the their generated images. Also, we show that our method can
effect of the training data size on attribution performance. be easily adapted to other unseen models.
The experimental results are depicted in Figure 10. We can
see that the size of training data indeed has a great influence
on attribution performance. For example, when the train-
5 Prompt Analysis
ing dataset size is 5,000, the hybrid attributor can achieve an
accuracy of 0.736, while the accuracy can be improved to One of the major differences between text-to-image genera-
0.946 when the training dataset size increases to 80,000. Be- tion models and traditional generation models is that the for-
sides, we can find that the hybrid attributor requires less data mer takes a prompt as input. In this section, we investigate
to achieve a stable performance compared to the image-only which kinds of prompts are more likely to generate authen-
attributor. For example, hybrid attribution achieves a huge tic images (RQ3). To answer this question, we perform a
multiple unseen text-to-image generation models. We will leave this as comprehensive prompt analysis from semantic and structural
future work. perspectives.

10
Table 3: Top five prompts which can generate the most real or fake images, determined by the image-only detector. Gray cells in Real
mean the prompt mainly describe the details of the subject. Gray cells in Fake mean the prompt mainly describe the environment
where the subject is located.

Rank Real Fake

A dog hanging out of A green bus is parked


Top1
a side window on a car on the side of the street

A pan filled with food THERE IS A ZEBRA THAT IS


Top2
sitting on a stove top EATING GRASS IN THE YARD

A birthday cake with English I sign that indicates the street


Top3
and Chinese characters name posted above a stop sign

There is an elephant-shaped figure A group of skiers as they


Top4
next to other decorations ski on the snow

there is a cake and donuts A bench is surrounded by


Top5
that look like a train grass and a few flowers
Proportion of fake images classified as real

0.40

0.35

0.30

0.25

0.20

0.15

0.10

0.05
(a) Skis (b) Snowboard
0.00
skis

kite
bird
cat
zebra
sink
snowboard
giraffe
sheep

chair
toilet
bottle
bowl
person
airplane
suitcase
banana
teddy bear

umbrella
frisbee

Figure 12: Examples of fake images generated by SD given


prompts with topics “skis” and “snowboard.”

we adopt the image-only detector instead of the hybrid de-


Figure 11: The top twenty topics of prompts in terms of the
tector. Note that our analysis is conducted on fake images
proportion of the corresponding generated fake images being
classified as real by the image-only detector. The topics are ex- generated by SD given prompts from MSCOCO.
tracted from the MSCOCO dataset. We first utilize a straightforward method to group prompts
relying on the topics provided by MSCOCO. In total, there
are 80 topics in MSCOCO. We select the top twenty topics
with the highest real image proportion decided by the image-
5.1 Semantics Analysis
only detector and report the results in Figure 11. We can
We first conduct semantic analysis on prompts based on their clearly observe that among the top twenty topics, “skis” and
topics. Concretely, we group prompts into different clusters “snowboard” are ranked the highest. Also, there are many
by topic. Then, for each cluster/topic, we calculate the pro- topics related to animals, such “sheep,” “cat,” “zebra,” etc.
portion of the corresponding fake images being classified as Though the topics from MSCOCO are straightforward,
real images by our image-only detector (Section 3). A cluster they may not be able to represent the full semantics of the
with a higher proportion indicates the prompts with the un- images. Therefore, we take another approach. Specifically,
derlying topic have a higher chance of generating authentic we take advantage of sentence transformer [26] based on
images. As we focus on the authenticity of an image itself, BERT [8] to generate embeddings for the prompts and then

11
1.0 1.0 a significant impact on fake images’ authenticity.
0.8 0.8
Authenticity

Authenticity
0.6 0.6 5.3 Takeaways
0.4 0.4
In summary, we conduct semantic analysis and structure
0.2 0.2
analysis to study which types of prompts are more likely to
0.0 0.0
50 100 150 200 250 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 drive text-to-image generation models to generate fake im-
Length Proportion of nouns
ages with high authenticity. Empirical results demonstrate
that a prompt with the topic “person” or length between 25
(a) Length (b) Proportion of noun
and 75 is more likely to produce authentic images, thus lead-
Figure 13: The relationship between the length\proportion of ing to difficulties in detection by our designed detectors.
nouns in a prompt and the corresponding image’s authenticity.

6 Related Work
group the embeddings with DBSCAN [11], an advanced
6.1 Text-to-Image Generation
clustering method. The advantage of the second approach
is that it can implicitly reflect the in-depth semantics of the Typically, text-to-image generation takes a text description
prompts, which is also a common practice in the natural lan- (i.e., a prompt) as input and outputs an image that matches
guage processing literature. However, the disadvantage of the text description. Some pioneer works of text-to-image
this approach is that the concrete topic of each cluster needs generation [25, 38] are based on GANs [13]. By combining
to be manually summarized. By manually checking, the clus- a prompt embedding and a latent vector, the authors expect
ter with the highest real image proportion is related to the the GANs to generate an image depicting the prompt. These
topic “person.” Ostensibly, this is different from the results of works have stimulated more researchers [3, 15, 18, 31, 34, 37]
the first approach (“skis” and “snowboard” ranked the high- to study text-to-image generation models based on GANs,
est), which is based on the topics provided by MSCOCO. but using GANs does not always achieve good generation
However, by manually checking fake images by prompts performance [24, 27].
with topics “skis” and “snowboard,” we discover that most Recently, text-to-image generation has made great
of them depict “person” as well. We show some examples in progress with the emergence of diffusion models [1, 21,
Figure 12. This indicates the prompts related to “person” are 27, 28]. Models in this domain normally take random
likely to generate authentic fake images. noise and prompts as input and reduce noisy images to
We further extract the top 5 prompts that can generate the clear ones based on the guidance of prompts. Currently,
most real and fake images, respectively, according to the text-to-image generation based on diffusion models, such
image-only detector, and list them in Table 3. We can also as DALL·E [24], Stable Diffusion [27], Imagen [28],
find that detailed descriptions of the subjects contribute to GLIDE [21] and DALL·E 2 [23], has achieved state-of-the-
the generation of authentic images. For example, of the top art performance compared to previous works. This is also the
five real prompts, four provide a detailed description of the reason why we focus on such models in this work.
subject, while four of the top five fake prompts describe the
environment where the subject is located, rather than the sub- 6.2 Fake Image Detection and Attribution
ject itself. In the future, we plan to investigate in-depth the
relationship between the prompts’ semantics and the gener- Wang et al. [33] find that a simple CNN model can eas-
ated images’ authenticity. ily detect fake images generated by various types of tradi-
tional generation models (e.g., GANs [13] and low-level vi-
sion models [4, 7]) from real images. The authors argue that
5.2 Structure Analysis these fake images have some common defects that allow us
After semantic analysis, we now conduct the structure anal- to distinguish them from real images. Yu et al. [36] demon-
ysis. Specifically, we study prompt structure from two an- strate that fake images generated by various traditional gen-
gles, i.e., the length and the proportion of nouns. The length eration models can be attributed to their sources and reveal
of the prompt reflects the prompt complexity. The propor- the fact that these traditional generation models leave finger-
tion of nouns is related to the number of objects appearing prints in the generated images. Girish et al. [12] further pro-
in the fake image. Here, we use Natural Language Toolkit pose a new attribution method to deal with the open-world
(NLTK) [2] to compute the proportion of nouns in a prompt. scenario where the detector has no knowledge of the genera-
In our experiments, we randomly select 5,000 prompts tion model.
from MSCOCO and then feed these prompts to SD to gen- We emphasize here that almost all existing works focus
erate fake images. Results are shown in Figure 13. We only on traditional generation models, such as GANs [13],
can see from Figure 13a that both extremely long and short low-level vision models [4, 7], and perceptual loss genera-
prompts cannot generate authentic images. In addition, al- tion models [5, 17]. Detecting and attributing fake images
most all high-authenticity images are generated by prompts generated by text-to-image generation models are largely un-
with lengths between 25 to 75. On the other hand, Figure 13b explored. In this work, we take the first step to systematically
shows that the proportion of nouns in prompts does not have study the problem.

12
7 Conclusion [2] Steven Bird and Edward Loper. NLTK: The Natural Language
Toolkit. In Annual Meeting of the Association for Computa-
In this paper, we delve into three research questions concern- tional Linguistics (ACL). ACL, 2004. 12
ing the detection and attribution of fake images generated [3] Navaneeth Bodla, Gang Hua, and Rama Chellappa. Semi-
by text-to-image generation models. To solve the first re- supervised FusedGAN for Conditional Image Generation. In
search question of whether we can distinguish fake images European Conference on Computer Vision (ECCV), pages
apart from real ones, we propose fake image detection. Our 689–704. Springer, 2018. 12
fake image detection consists of two types of detectors: an [4] Chen Chen, Qifeng Chen, Jia Xu, and Vladlen Koltun. Learn-
image-only detector and a hybrid detector. The image-only ing to See in the Dark. In IEEE Conference on Computer
detector utilizes images as the input to identify fake images, Vision and Pattern Recognition (CVPR), pages 3291–3330.
while the hybrid detector leverages both image information IEEE, 2018. 5, 12
and the corresponding prompt information. In the testing [5] Qifeng Chen and Vladlen Koltun. Photographic Image Syn-
phase, if the hybrid detector cannot obtain the natural prompt thesis with Cascaded Refinement Networks. In IEEE Interna-
of an image, we take advantage of BLIP, an image caption- tional Conference on Computer Vision (ICCV), pages 1520–
ing model, to generate a prompt for the image. Our exten- 1529. IEEE, 2017. 12
sive experiments show that while an image-only detector can [6] Yunjey Choi, Min-Je Choi, Munyoung Kim, Jung-Woo Ha,
achieve strong performance on certain text-to-image genera- Sunghun Kim, and Jaegul Choo. StarGAN: Unified Genera-
tion models, a hybrid detector can always have better perfor- tive Adversarial Networks for Multi-Domain Image-to-Image
mance. These results demonstrate that fake images generated Translation. In IEEE Conference on Computer Vision and Pat-
by different text-to-image generation models share common tern Recognition (CVPR), pages 8789–8797. IEEE, 2018. 5
features. Also, prompts can serve as an extra “anchor” to [7] Tao Dai, Jianrui Cai, Yongbing Zhang, Shu-Tao Xia, and Lei
help the detector better differentiate between fake and real Zhang. Second-Order Attention Network for Single Image
images. Super-Resolution. In IEEE Conference on Computer Vision
To tackle the second research question, we conduct the and Pattern Recognition (CVPR), pages 11065–11074. IEEE,
2019. 5, 12
fake image attribution to attribute fake images from differ-
ent text-to-image generation models to their source models. [8] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina
Similarly, we develop two types of multi-class classifiers: an Toutanova. BERT: Pre-training of Deep Bidirectional Trans-
formers for Language Understanding. In Conference of the
image-only attributor and a hybrid attributor. Empirical re-
North American Chapter of the Association for Computa-
sults show that both image-only attributor and hybrid attrib-
tional Linguistics: Human Language Technologies (NAACL-
utor have good performance in all cases. This implies that HLT), pages 4171–4186. ACL, 2019. 11
fake images generated by different text-to-image generation
[9] Pierre L. Dognin, Igor Melnyk, Youssef Mroueh, Inkit Padhi,
models enjoy different properties, which can also be viewed
Mattia Rigotti, Jarret Ross, Yair Schiff, Richard A. Young, and
as fingerprints. Brian Belgodere. Image Captioning as an Assistive Technol-
Finally, we address the third research question, i.e., which ogy: Lessons Learned from VizWiz 2020 Challenge. Journal
kinds of prompts are more likely to generate authentic im- of Artificial Intelligence Research, 2022. 7
ages? We study the properties of prompts from semantic and [10] Kreiss Elisa, Goodman Noah D, and Potts Christopher. Con-
structural perspectives. From the semantic perspective, we cadia: Tackling Image Accessibility with Descriptive Texts
show that prompts with the topic “person” can achieve more and Context. CoRR abs/2104.08376, 2021. 7
authentic fake images compared to prompts with other top- [11] Martin Ester, Hans-Peter Kriegel, Jörg Sander, and Xiaowei
ics. From the structural perspective, our experiments reveal Xu. A Density-Based Algorithm for Discovering Clusters in
that prompts with lengths ranging from 25 to 75 allow text- Large Spatial Databases with Noise. In International Confer-
to-image generation models to create more authentic fake im- ence on Knowledge Discovery and Data Mining (KDD), pages
ages. 226–231. AAAI, 1996. 12
Overall, this work presents the first comprehensive study [12] Sharath Girish, Saksham Suri, Sai Saketh Rambhatla, and
of detecting and attributing fake images generated by state- Abhinav Shrivastava. Towards Discovery and Attribution of
of-the-art text-to-image generation models. As our empirical Open-World GAN Generated Images. In IEEE International
results are encouraging, we believe our detectors and attribu- Conference on Computer Vision (ICCV), pages 14094–14103.
IEEE, 2021. 2, 12
tors can play an essential role in mitigating the threats caused
by fake images created by the advanced generation models. [13] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing
We will share our code to facilitate research in this field in Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and
the future. Yoshua Bengio. Generative Adversarial Nets. In Annual Con-
ference on Neural Information Processing Systems (NIPS),
pages 2672–2680. NIPS, 2014. 1, 12
[14] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
References Deep Residual Learning for Image Recognition. In IEEE Con-
[1] James Atwood and Don Towsley. Diffusion-Convolutional ference on Computer Vision and Pattern Recognition (CVPR),
Neural Networks. In Annual Conference on Neural Infor- pages 770–778. IEEE, 2016. 4, 8
mation Processing Systems (NIPS), pages 1993–2001. NIPS, [15] Qicheng Lao, Mohammad Havaei, Ahmad Pesaranghader,
2016. 12 Francis Dutil, Lisa Di-Jorio, and Thomas Fevens. Dual Adver-

13
sarial Inference for Text-to-Image Synthesis. In IEEE Interna- [28] Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay
tional Conference on Computer Vision (ICCV), pages 7566– Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour,
7575. IEEE, 2019. 12 Burcu Karagol Ayan, S. Sara Mahdavi, Rapha Gontijo Lopes,
[16] Junnan Li, Dongxu Li, Caiming Xiong, and Steven C. H. Hoi. Tim Salimans, Jonathan Ho, David J. Fleet, and Mohammad
BLIP: Bootstrapping Language-Image Pre-training for Uni- Norouzi. Photorealistic Text-to-Image Diffusion Models with
fied Vision-Language Understanding and Generation. CoRR Deep Language Understanding. CoRR abs/2205.11487, 2022.
abs/2201.12086, 2022. 2, 5 12
[17] Ke Li, Tianhao Zhang, and Jitendra Malik. Diverse Image [29] Shibani Santurkar, Yann Dubois, Rohan Taori, Percy Liang,
Synthesis From Semantic Layouts via Conditional IMLE. In and Tatsunori Hashimoto. Is a Caption Worth a Thousand Im-
IEEE International Conference on Computer Vision (ICCV), ages? A Controlled Study for Representation Learning. CoRR
pages 4219–4228. IEEE, 2019. 12 abs/2207.07635, 2022. 7
[18] Wenbo Li, Pengchuan Zhang, Lei Zhang, Qiuyuan Huang, Xi- [30] Christoph Schuhmann, Richard Vencu, Romain Beaumont,
aodong He, Siwei Lyu, and Jianfeng Gao. Object-Driven Text- Robert Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo
To-Image Synthesis via Adversarial Training. In IEEE Con- Coombes, Jenia Jitsev, and Aran Komatsuzaki. LAION-
ference on Computer Vision and Pattern Recognition (CVPR), 400M: Open Dataset of CLIP-Filtered 400 Million Image-
pages 1274–12182. IEEE, 2019. 12 Text Pairs. CoRR abs/2111.02114, 2021. 3
[19] Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James Hays, [31] Douglas M. Souza, Jonatas Wehrmann, and Duncan D. Ruiz.
Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Efficient Neural Architecture for Text-to-Image Synthesis. In
Zitnick. Microsoft COCO: Common Objects in Context. International Joint Conference on Neural Networks (IJCNN),
In European Conference on Computer Vision (ECCV), pages pages 1–8. IEEE, 2020. 12
740–755. Springer, 2014. 2, 3 [32] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-
[20] Hodosh Micah, Young Peter, and Hockenmaier Julia. Framing reit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia
Image Description as a Ranking Task: Data, Models and Eval- Polosukhin. Attention is All you Need. In Annual Confer-
uation Metrics. Journal of Artificial Intelligence Research, ence on Neural Information Processing Systems (NIPS), pages
2013. 7 5998–6008. NIPS, 2017. 3
[21] Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav [33] Sheng-Yu Wang, Oliver Wang, Richard Zhang, Andrew
Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Owens, and Alexei A. Efros. CNN-Generated Images Are
Mark Chen. GLIDE: Towards Photorealistic Image Genera- Surprisingly Easy to Spot... for Now. In IEEE Conference
tion and Editing with Text-Guided Diffusion Models. CoRR on Computer Vision and Pattern Recognition (CVPR), pages
abs/2112.10741, 2021. 2, 3, 12 8692–8701. IEEE, 2020. 2, 5, 12
[22] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya [34] Zixu Wang, Zhe Quan, Zhi-Jie Wang, Xinjian Hu, and
Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Yangyang Chen. Text to Image Synthesis With Bidirectional
Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Generative Adversarial Network. In International Conference
Krueger, and Ilya Sutskever. Learning Transferable Visual on Multimedia and Expo (ICME), pages 1–6. IEEE, 2020. 12
Models From Natural Language Supervision. In International [35] Peter Young, Alice Lai, Micah Hodosh, and Julia Hocken-
Conference on Machine Learning (ICML), pages 8748–8763. maier. From image descriptions to visual denotations: New
PMLR, 2021. 2, 3, 7 similarity metrics for semantic inference over event descrip-
[23] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, tions. Transactions of the Association for Computational Lin-
and Mark Chen. Hierarchical Text-Conditional Image Gener- guistics, 2014. 2, 3
ation with CLIP Latents. CoRR abs/2204.06125, 2022. 2, 3, [36] Ning Yu, Larry Davis, and Mario Fritz. Attributing Fake Im-
12 ages to GANs: Learning and Analyzing GAN Fingerprints. In
[24] Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, IEEE International Conference on Computer Vision (ICCV),
Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. pages 7555–7565. IEEE, 2019. 2, 8, 12
Zero-Shot Text-to-Image Generation. In International Confer- [37] Han Zhang, Jing Yu Koh, Jason Baldridge, Honglak Lee, and
ence on Machine Learning (ICML), pages 8821–8831. JMLR, Yinfei Yang. Cross-Modal Contrastive Learning for Text-
2021. 1, 12 to-Image Generation. In IEEE Conference on Computer Vi-
[25] Scott E. Reed, Zeynep Akata, Xinchen Yan, Lajanugen Lo- sion and Pattern Recognition (CVPR), pages 833–842. IEEE,
geswaran, Bernt Schiele, and Honglak Lee. Generative Ad- 2021. 12
versarial Text to Image Synthesis. In International Confer- [38] Han Zhang, Tao Xu, and Hongsheng Li. StackGAN: Text
ence on Machine Learning (ICML), pages 1060–1069. JMLR, to Photo-Realistic Image Synthesis with Stacked Generative
2016. 12 Adversarial Networks. In IEEE International Conference on
[26] Nils Reimers and Iryna Gurevych. Sentence-BERT: Sentence Computer Vision (ICCV), pages 5908–5916. IEEE, 2017. 12
Embeddings using Siamese BERT-Networks. In Conference [39] Xu Zhang, Svebor Karaman, and Shih-Fu Chang. Detecting
on Empirical Methods in Natural Language Processing and and Simulating Artifacts in GAN Fake Images. In IEEE In-
International Joint Conference on Natural Language Process- ternational Workshop on Information Forensics and Security
ing (EMNLP-IJCNLP), pages 3980–3990. ACL, 2019. 2, 11 (WIFS), pages 1–6. IEEE, 2019. 6
[27] Robin Rombach, Andreas Blattmann, Dominik Lorenz,
Patrick Esser, and Björn Ommer. High-Resolution Image Syn-
thesis with Latent Diffusion Models. In IEEE Conference
on Computer Vision and Pattern Recognition (CVPR), pages
10684–10695. IEEE, 2022. 1, 2, 3, 12

14

You might also like