SReferences
Individualized Deepfake Detection Exploiting Traces Due to Double Neural-Network Operations
Abstract
In today’s digital landscape, journalists urgently require tools to verify the authenticity of facial images and videos depicting specific public figures before incorporating them into news stories. Existing deepfake detectors are not optimized for this detection task when an image is associated with a specific and identifiable individual. This study focuses on the deepfake detection of facial images of individual public figures. We propose to condition the proposed detector on the identity of the identified individual given the advantages revealed by our theory-driven simulations. While most detectors in the literature rely on perceptible or imperceptible artifacts present in deepfake facial images, we demonstrate that the detection performance can be improved by exploiting the idempotency property of neural networks. In our approach, the training process involves double neural-network operations where we pass an authentic image through a deepfake simulating network twice. Experimental results show that the proposed method improves the area under the curve (AUC) from 0.92 to 0.94 and reduces its standard deviation by 17%. For evaluating the detection performance of individual public figures, a facial image dataset with individuals’ names is required, a criterion not met by the current deepfake datasets. To address this, we curated a dataset comprising 32k images featuring 45 public figures, which we intend to release to the public after the paper is published.
1 Introduction
A deepfake refers to a seemingly authentic image or video generated by a deep neural network. When it comes to human faces, a manipulation method may comprise reenactment, replacement, editing, and synthesis [41]. While deepfakes can facilitate numerous appealing and advantageous applications, the act of replacing the face in a staged image or video with the face of a public figure can pose a serious threat to the society. Given the continuous influx of deepfake videos on public platforms, journalists need to pay special attention to those that relate to significant public interest, such as those featuring celebrities or politicians [41, 26]. The deepfake generation methods evolved with autoencoder-based approaches [1], GANs [6], and diffusion models [47]. The latest diffusion-based models such as [47, 49] can surpass GAN-based models in producing photorealistic images. Nevertheless, even in the present day, autoencoder-based models remain threatening in terms of malicious use. This is due to the availability of several free, downloadable, and user-friendly applications built on autoencoder, such as FaceSwap [5], Faceswap-GAN [6], DeepFaceLab [4], and df [2]. In this work, we focus on Faceswap-GAN.
Most deepfake detectors were built to detect the whole population of deepfake videos, i.e., deepfake videos of whatever identities are targeted. However, victims of deepfakes are most often public figures and their deepfake videos are more detrimental due to their widespread public exposure. In this work, we propose a deepfake image detection system customized for individual subjects. Our theory-driven simulations suggest that identity conditioning on deepfake detection tends to exhibit advantages in more challenging detection tasks. As our experimental results will show, the existing tools for deepfake face detection that encompass the whole population may work suboptimally for a specific public figure. The proposed detector for specific individuals is especially useful for journalism. For example, before reporting news based on an image of a public figure of unknown authenticity, a journalist can apply the proposed detection tool to determine its authenticity.
Our approach to deepfake detection draws inspiration from a series of studies leveraging the near-idempotence property of an operation. This method has been particularly effective in various image forensics tasks, including double JPEG compression detection, unknown video codec identification, and source camera identification [32, 53, 9, 23, 40]. In these studies, researchers leverage the near-idempotence of a respective operation, such as certain type of JPEG compression, video compression, or color demosaicing algorithm. The strict idempotence property asserts that an idempotent operation, , results in no change to when it is applied iteratively, i.e., . Using slightly different terminology, if approximately equals , the operation is nearly idempotent. In many detection problems of multimedia forensics, the nearly idempotent nature of a forgery method allows an analyst to apply the forgery operation multiple times and observe the changes to determine whether the input was forged for the first time, i.e., input forged for more than once will exhibit minimal changes.
In this work, we demonstrate that near-idempotence is also applicable to the neural network-based Faceswap-GAN [6]. To explore this, we emulate the potential deepfake operation that an attacker might employ, utilizing publicly available data of a public figure and making assumptions about the neural network architecture. Fig. 1 illustrates the inference pipeline of the proposed detector. We feed a test image into the emulated deepfake generator. The expected change in the image due to this operation is dependent on whether the image has undergone a similar operation before. If the image is a deepfake, the near-idempotence property ensures that the change will be minimal. From the standpoint of the deepfake feature extractor, a deepfake image will exhibit processing traces both before and after the operation, leading to subtle observed changes. Conversely, an authentic image without the deepfake operation lacks any processing traces of the neural network, resulting in a significant observable change. The contributions of this paper are threefold.
-
•
We propose to use the near-idempotence property of neural networks for deepfake face detection, introducing a distinct direction of improvement compared to the state of the art. The idempotence-driven approach can potentially complement existing methods.
-
•
We demonstrate that identity conditioning can significantly improve the deepfake detection performance over the state-of-the-art end-to-end CNN classifiers.
-
•
Our detector can focus on specific individuals. Individualized detectors are better suited for journalism.
2 Related Work
2.1 Generation of Deepfake Faces
Early methods of face-swapping such as Bitouk et al. [10] were limited to using two images of two particular persons with similar poses. The images were first aligned with the help of landmark detection, then cropped, and postprocessed including color correction. Subsequent researchers [19] improved those with a 3-D facial model from the source video. The next advancement emerged after the proposal of a deep-learning-based face-swapping architecture [1] built upon one shared encoder and two individual decoders. Faceswap-GAN [6] is the GAN improvement over [1] where the performance of shared encoder and individual decoders further improve as a result of the GAN’s internal interplay mechanism between the generator and discriminator. However, the architectures of [1, 6] can swap faces between only two identities involved in training. Researchers have proposed identity agnostic architectures decoupling the identity extraction from the attribute extraction [8, 45, 42, 43, 37].
2.2 Protection Against Deepfake
Researchers have been exploring different methods to detect deepfakes. In the first category, the artifacts of synthetic videos are exploited for deepfake detection such as the absence of eye blinking [38], inconsistency in head pose. [54], disparities in color components [36], and inconsistency between inner face and outer face [25]. In the second category, researchers used either an end-to-end convolutional neural network (CNN) structure [51] or a combined CNN with a recurrent neural network (RNN) [29]. In the third category, researchers exploit processing traces left by the neural networks for deepfake detection. The researchers exploited the features like spatial domain local convolutional features [28], spectral distortion caused by up-convolutions [26], and upsampling artifacts in the frequency domain [27].
Identity-driven Deepfake Detection. Instead of detecting deepfake videos for the whole population, recent work also exploited characteristics of a specific person for deepfake detection. Agarwal et al. [7] targeted deepfake videos of a specific individual by capturing speaking patterns. Cozzolino et al. [21] proposed to learn the temporal features of how a specific person moves and talks. Dong et al. [25] calculated distance between the computed identity vector from the inner face and the expected identity vector drawn from a reference set of identity vectors. In this work, we extract the deepfake traces conditioned on the identity.
2.3 Idempotency as a Multimedia Forensics Tool
In multimedia forensics, one way to detect counterfeiting is to exploit the near-idempotence property, i.e., the minor changes caused by the repetitive application of adversarial operations. It shares the same sprit of the law of diminishing returns, a widely used concept in economics [13, 50]. The detection of double JPEG compression, source camera identification, and video codec identification are three exemplary applications of the near-idempotence property. The ratio of stable image blocks has been used by researchers to detect the number of prior JPEG compressions [35, 15]. Huang et al. [32] found that the number of dissimilar JPEG coefficients between two subsequent JPEG compression decreases monotonically. Bestagini et al. [9] detected unknown video encoding by recompressing a video with each of the candidates. For source camera identification, the researchers have leveraged the near-idempotence property of an auto-white balancing method [23] and that of color demosaicing strategy [40]. In economics, the law of diminishing returns states that additional inputs to a fixed amount of identical inputs increase productivity at a decreasing rate [13]. If the additional inputs are considered repetitive operations, then the law of diminishing returns may be considered as near-idempotence. In this study, we show that the near-idempotence property of neural networks assists in deepfake image detection.
2.4 Unsupervised Pretraining
Unsupervised pertaining has been proposed for feature extraction for many tasks of computer vision. Chen et al. [18] found that larger networks, for example, larger ResNet, pretrained in an unsupervised manner followed by supervised training with only of labeled data can outperform fully supervised networks for general computer vision tasks. Newell and Deng [44] showed that pretrained networks are more advantageous in low data regimes compared to ubiquitous data. Their results suggest that pretrained networks should be tested on diverse downstream tasks. Bulat et al. [14] proposed task-agnostic self-supervised pretraining on in-the-wild facial data for representation learning. Zheng et al. [58] proposed weakly supervised facial representation learning using vast facial images available on the web with linguistic descriptions. In this work, we utilize the facial features from Bulat et al. [14] to additionally learn the deepfake traces.
3 Threat Model
In this work, we consider an attacker who is smart enough to find and use open-source face-swapping software such as [6, 5, 1] on the facial images from the publicly available videos of a public figure. More specifically, we consider Faceswap-GAN [6] as a potential method that the attacker can use. The attacker is free to use any public or private videos of a second person to depict a story and they want to convince the public of the involvement of a targeted public figure. For example, the attacker can record prearranged videos at a professional studio and later replace the actor’s face with that of the public figure. The attacker can harvest videos of the public figure from multiple sources, including social media, news channels, movies, and YouTube. Different sources of videos offer varied image quality, compression levels, and processing histories. For example, public interview videos of a public figure available on YouTube are expected to be less edited than video clips from movies. In our proposed detection method, we assume that we, as forensic analysts, have access to the various sources of public figure videos, but we do not know exactly from what source the attacker took videos for deepfake generation. For example, the attacker can use videos from social media, where we will only use public interview recordings of that public figure to train the neural network based detector.
4 Proposed Detector via Near-Idempotence and Identity Conditioning
In the challenge of identifying deepfake faces for public figures, we confront an image of unknown authenticity, claimed to be a specific public figure. Our approach to addressing this problem makes use of the extensive collection of authentic images or videos of the said public figure from YouTube. The training process of our proposed deepfake detector is depicted in Fig. 2 and the inference pipeline is shown in Fig. 1.
Our proposed detector has four distinct components. First, the reconstruction operator is a neural network operation that stimulates the deepfake generation operation for a public figure. We found this operation nearly idempotent. Second, the feature extractor is finetuned with a teacher network and is able to capture the identity information while extracting the features. Third, the identity decoder takes as input the explicit identity, i.e., the index of the public figure, and learns as a constant identity vector that arguments the feature space. It contains the necessary person-specific information of that public figure, and when combined with the identity-aware feature can effectively compute the deepfake features conditioned on identity. Fourth, the Siamese network serves as the ultimate binary classification block in the proposed architecture. It learns to extract the features linked to the idempotency of the deepfake operation. It produces a larger distance before and after reconstruction for a test authentic image and a smaller distance for a test deepfake image.
4.1 Reconstruction Operator and Idempotence-Driven Detection
We employ a dedicated reconstruction operator for each public figure as shown in Fig. 1 and Fig. 2. When the original image is authentic, the first operation generates a deepfake image, and the second operation produces a doubly processed deepfake. We verified experimentally that the reconstruction operator serves as a reliable approximation of a specific type of deepfake generation tool, such as FaceSwap-GAN [6], and that the deepfake generation process is nearly idempotent. In this context, the distance between a deepfake image and its corresponding doubly processed deepfake tends to be close to zero. This characteristic is leveraged in the training and inference system.
The next consideration is how to obtain the identity-specific reconstruction operator. For each public figure within our scope, we accumulate numerous images of that public figure and train a neural network based on an autoencoder utilizing the encoder and decoder architecture from FaceSwap-GAN [6]. This network learns the facial characteristics of the public figure, and when given a facial image of that public figure, it can reproduce approximately the same image as the output. Since the objective of this network is to replicate the input facial image of an identity, we refer to the resulting operator as the reconstruction operator or emulated deepfake generator. Some examples of reconstructed images are shown in Fig. 3.
The reconstruction operator exhibits near-zero changes to a deepfake image due to the near-idempotence. Consequently, the feature level Euclidean distance between the two is expected to be small. On the other hand, an authentic image and its corresponding processed image will be substantially different as the operation leaves discernible traces in the processed image. Considering the capability of our deepfake feature extractor (see Sec. 4.2) to detect these traces, the features will exhibit significant dissimilarity, resulting in a higher distance compared to the deepfake scenario.
Based on the above considerations, the initial problem of detecting whether an image is authentic or deepfake is now reframed as evaluating the change of the image in the feature space through the reconstruction operation. When this change, quantified as the Euclidean distance, approaches zero, the image is classified as a deepfake; otherwise, it is considered authentic. Denoting the input image by f, the reframed problem is to evaluate whether f and are the same or not, where is our reconstruction operator. Treating f and as two inputs, we note that the Siamese network [12] is a powerful approach for discerning similarity or dissimilarity between two inputs. Our use of the Siamese network will be discussed in Sec. 4.4.
4.2 Identity-Aware Feature Extractor
Motivation. Conventional deepfake feature extraction network extracts the deepfake features for a test image ignoring the person identity [11, 22, 57] or considers the identity features irrelevant to forgery detection [30, 52]. Our work found that the identity-aware feature, , which extracts identity information in addition to the deepfake features, is more effective for deepfake detection. This may be explained that a distinct extracted feature may not be equally distinguishable for every identity for the classification. If a feature extractor does not allow the passing of the identity information, the later network can not learn the statistics of the features individually for each identity. This will be limited to learning the average pattern. Such average distributions of the features will lead to the error probability of the Bayesian classifier as follows:
(1) |
where is the probability measure, and are two hypotheses, is the predicted class. On the other hand, if the feature extractor allows passing the identity, the later network can distinguish the features for each identity separately. Knowing the distributions of the features for each identity separately will lead to the error probability:
(2) | ||||
where is the set of all identities. In Sec. 3 of the supplementary document, we showed that the latter identity-conditioning approach is more powerful in reducing classification error. We conducted a performance comparison between two methods through theory-driven simulations, demonstrating that tends to be lower (better) than . Furthermore, we observed that the gain of over is more significant when the deepfake traces for individuals are more unique and the detection problem is intrinsically more difficult.
Training. To make the feature extraction network identity-aware, we use a neural network such that the earlier layers extract identity-aware features along with other features, and the later layers extract deepfake traces. We use a learned facial representation, trained by Bulat et al. [14] as the starting point of training . Their trained network has an architecture of ResNet. For extracting deepfake features, we tune the portion of the network after the “conv4” block.
We reused the model and initial weights from Bulat et al. [14] for the following three reasons. First, having an existing network that lets personal identity pass through makes our task easier to additionally learn the deepfake traces. In comparison, training a network simultaneously for personal identity and deepfake detection would require joint training of two downstream tasks, which is harder. Second, a deeper network trained with unlabelled data is less biased to any specific portion of the dataset [18]. Bulat et al. [14] pretrained the ResNet architecture with million facial images. Consequently, the initial layers of the network are anticipated to learn a robust representation of features, including the identity. The network is also tested over multiple downstream tasks and therefore, it is a good candidate for extracting facial features [44]. Third, according to Newell and Deng [44], there is an advantage in unsupervised pretraining with unlabeled data when the labeled finetuning dataset is small, which aligns with our labeled training dataset comprising 295 videos from 59 celebrities.
The training for the backbone network is depicted in Fig. 4. The input is an image pair, consisting of an authentic image and its corresponding deepfake, generated using a deepfake generation tool. The input is passed through a student network and a teacher network in parallel. The student network is composed of the pretrained facial representation learning backbone [14] and a concatenated task adaptation head for learning specifically the deepfake traces. The layers after the “conv4” block of the pretrained backbone and the task adaptation head are the tunable portions of the student network. We then utilize the EfficientNetAutoAttB4ST [11] as the teacher network to distill the knowledge for learning the deepfake traces. To adapt the deepfake traces based on personal identity, we add a loss function that contrasts the learned traces of a deepfake and its corresponding authentic image in addition to the knowledge distillation losses and . Given the authentic facial image of identity , , and its corresponding deepfake image , the loss terms are defined as follows:
(3a) | ||||
(3b) | ||||
(3c) |
where is the Euclidean distance and is the margin of the hinge loss. The three loss terms are combined as , with hyperparameters and . and contribute to the knowledge distillation for learning the deepfake traces and contributes to learning the deepfake traces according to identity.
4.3 Identity Decoder for Feature Conditioning
Our identity decoder is a single-layer fully-connected neural network that maps the one-hot-encoded index of a public figure to the feature space generated by our feature extractor. We combine the output of the identity decoder with the output of the identity-aware feature extractor that contains the joint information of the deepfake feature and identity. The extra marginal information provided by the identity decoder can have the effect of conditioning the identity-aware feature, in a similar spirit as in the Bayes rule.
4.4 Contrastive Learning
The Siamese network contains two identical subnetworks that process the two inputs parallelly. The subnetworks learn a manifold for each of the inputs adopting contrastive loss that allows a powerful discrimination between the two inputs. In our work, we designed each of the subnetworks as a single-layer neural network that takes as input the features of the corresponding image and outputs a vector of length . We experimentally verified that this length is enough for discriminating the two cases. Let us call the two subnetworks of the Siamese network and , where the first one processes the features of f and the second one processes the features of . We used contrastive loss [31] to train the Siamese network as follows:
(4) |
where is the Euclidean distance between the processed manifolds, i.e., , is the identity-conditioned features of f, is the identity-conditioned features of , is a margin, and is the known binary label of f, i.e., is if f authentic, and otherwise. We learned the weights of the identity decoder and the two subnetworks of the Siamese network using this loss function. Additionally, in contrast to the standard Siamese network, we decoupled the weights of the two subnetworks, and , similar to CLIP [46], resulting in performance enhancement.
5 Experimental Results
5.1 Dataset Curation
Deepfake detection methods generally perform well in the in-dataset evaluation, but the performance drops significantly in cross-dataset evaluation [17]. To provide reliable measurements for the performance of the deepfake detection method, we report only the cross-dataset evaluation results. As our deepfake detector conditions on the identity, we need two independent datasets containing facial images of the same set of identities. Most of the public deepfake detection datasets such as DFDC [24], DFD [3], Deeper Forensics [33] do not explicitly mention identity information associated with the videos. It is hard to find the same persons in another dataset, which would be necessary to perform the cross-dataset evaluation of individualized deepfake detection. To the best of our knowledge, our work is the first work that evaluates the deepfake detection performance on another dataset for each person separately. We curate a dataset from Celeb-DF [39] and CACD [16]. Our curated dataset contains facial images of public figures sourced from YouTube videos for the train set and from the cross-age facial image dataset of the same public figures for the test dataset. We plan to release the source code and dataset for public use after publication.
For the training dataset, we use real videos from the Celeb-DF dataset [39], which is a popular deepfake detection dataset of 59 public figures. We sample frames from the videos at frames per second (fps), and detect faces from the videos using the MTCNN [56] face detection network. For each individual , we have facial images from the th authentic Celeb-DF video, where , , and is the number of the frames extracted from the video.
Examining multiple candidate datasets, we narrowed down to the CACD [16] dataset for cross-dataset evaluation. CACD [16] contains cross-age facial images of 2,000 public figures with an overlap of 45 public figures with the Celeb-DF dataset. From the CACD dataset, we have authentic images for the th identity and th available age group of that identity, where , , and is the number of the images available for that age group. To generate deepfake faces for the test set, for each person , we choose another identity from the database of 2,000 persons and then train a Faceswap-GAN [6] model using the facial images and .
5.2 Experimental Setup
As our proposed deepfake detector is designed to operate only when the identity is provided, we conduct our evaluation on our curated dataset comprising two distinct subsets of facial images with an overlap of public figures. We use the cropped facial images from the video frames of Celeb-DF [39] as the training set and the images from CACD [16] as the test set.
Our proposed method had two stages of training. In the first stage, we trained the identity-aware feature extractor. For this training, we resized the facial images to 224-by-224 and used random cropping and random horizontal flipping for image augmentation. As shown in Fig. 4, we used a pair of images for the backbone training. We enforced identical cropping within the same pair consisting of an authentic image and its deepfake. We used SGD optimizer and a learning rate of . The contrastive loss margin was and the values of and were varied manually within . We set to during the initial 1,500 epochs of training and subsequently modified it to for the next 1,500 epochs. In the second stage, we trained the Siamese network and the identity decoder. For this training, we used Adam optimizer, and the contrastive loss margin was and the learning rate was determined by the grid search within the range of .
For face reconstructor training, we separated the facial images from the last five videos of the Celeb-DF dataset. For the final classification network training, we randomly selected facial images from one video as the validation set and facial images from other four videos as the training set. We repeated this process four times to ensure the results would be statistically stable. As for the test set, we used all of the real and face-swapped images that we generated from CACD. In each training session, the neural network with the smallest validation loss was chosen as the final network for the test set.
5.3 Performance Gain
In this subsection, we compare the performance of our proposed method with two state-of-the-art methods. The first baseline considered is the Xception [20] network trained on the FaceForensics++ dataset [48] with deepfake videos generated by four methods including Faceswap [5]. The second one is the EfficientNetAutoAttB4ST [11] network trained on the DFDC dataset [24], a dataset consisting of deepfake videos generated by various popular face-swapping methods, such as Facewap-GAN [6], StyleGAN [34], Faceswap [5], and NTH [55]. The performance of the two baseline approaches on our test dataset is presented in Tab. 1.
Method | AUC | AUC | AUC trimmed |
Mean (SD) | Median (IQR) | Mean (10%) | |
Xception [20] | 0.792 (0.11) | 0.799 (0.14) | 0.799 |
Xception [20] (tuned) | 0.887 (0.07) | 0.896 (0.09) | 0.894 |
EfficientNet [11] | 0.728 (0.13) | 0.733 (0.16) | 0.732 |
EfficientNet [11] (tuned) | 0.920 (0.06) | 0.926 (0.07) | 0.927 |
Proposed | 0.940 (0.05) | 0.958 (0.05) | 0.947 |
To ensure a fair comparison with our proposed method, we conducted fine-tuning on these two baseline methods using our training dataset. This involved keeping the features frozen and training a classification layer on top of the features until the performance was saturated on the validation dataset. After finetuning, EfficientNetAutoAttB4ST [11] had an AUC mean of across identities with a sample standard deviation of .
To evaluate the idea of utilizing idempotency and identity conditioning, we applied the double neural network operation and obtained the features from our trained identity-aware feature extractor. We concatenated those with the features of EfficientNetAutoAttB4ST [11]. Tab. 1 reveals that the proposed method can achieve an AUC mean of across identities, an increase of from Bonettini et al. [11]. The AUC median across identities was with a gain of from the baseline [11]. The 10%-trimmed mean was with a gain of . The AUC standard deviation was reduced by or and the AUC interquartile range was reduced by or compared to the baseline [11]. This result demonstrates that idempotency and identity conditioning can improve performance in validity and variation. The detection results on the test dataset for six of the public figures are shown in Fig. 5.
The averaged AUC value among all public figures is and the sample standard deviation is . We also performed -tests and the proposed method is significantly better than those of the off-the-shelf detectors in terms of AUC. The larger variance of the AUC values of the baseline methods implies that the deepfake detector may perform convincingly for one identity, but it has a greater risk of exhibiting unacceptable performance for others. This makes the baseline methods less attractive for journalists.
5.4 Ablation Studies
Tab. 2 displays the results of ablation studies. In the first ablation study, we applied our idempotent strategy (with identity decoder) using the EfficientNetAutoAttB4ST features. In the second study, we concatenated the features from the identity-aware feature extractor with the features of EfficientNetAutoAttB4ST as we did in our proposed method and used a feedforward network to classify the images. The first ablation achieved the AUC mean of and the AUC median was . The sample standard deviation and interquartile range were and . The second ablation achieved the AUC mean of and the AUC median was . The sample standard deviation and interquartile range were and . The achieved AUC values are much lower compared to the proposed method. This confirms that the identity conditioning and idempotence strategy have synergy (positive interaction).
Method | AUC | AUC | AUC trimmed |
---|---|---|---|
Mean (SD) | Median (IQR) | Mean (10%) | |
Proposed | 0.940 (0.05) | 0.958 (0.05) | 0.947 |
Idempotence | 0.926 (0.05) | 0.928 (0.06) | 0.932 |
Identity-aware features | 0.893 (0.10) | 0.920 (0.13) | 0.904 |
6 Discussion
In the current work, the reconstruction model is the deepfake generation tool, i.e., Faceswap-GAN, one of the most popular and effective off-the-shelf tools for deepfake generation. When the reconstruction model does not match the tool used for deepfake generation, the processing traces caused by deepfake and reconstruction may be different, but both traces are generated by neural networks. For example, the reconstruction model is a Faceswap-GAN model whereas the deepfake video is generated by a Faceswap model. A more sophisticated classifier may be needed to exploit the processing traces left by different network structures. Under the proposed double operations framework, a more general Siamese neural network for processing-trace manifold learning may work, but we leave this exploration to future work.
Compared to end-to-end CNN-based classifiers, our proposed method targets deepfake detection for individuals, with main applications on public figures. Although our method needs training the reconstruction models, the training can be done in advance for each public figure. For example, a journalist can train the reconstruction models for various candidates before they need to verify videos for reporting tasks. Journalists may also share or collaboratively train detectors within their professional networks. To let the detection system support a new individual, the journalist will need to train a reconstruction operator for that individual and then finetune the Siamese network.
7 Conclusion and Future Work
In this work, we have proposed to use the method of double neural network operations and individual conditioning for the deepfake detection. The proposed detector can achieve better detection performance than end-to-end CNN-based detectors on our curated dataset of public figures with identity labels. We have found that utilizing identity information can make the deepfake detector more reliable. In future work, we plan to extend the double-operations detection to scenarios with mismatched neural network architectures.
References
- [1] Deepfakes. https://github.com/deepfakes/faceswap Accessed on: June, 2023.
- [2] Deepfake. https://github.com/dfaker/df Accessed on: June, 2023.
- [3] Deep fake detection dataset. https://ai.googleblog.com/2019/09/contributing-data-to-deepfake-detection.html Accessed on: June, 2023.
- [4] Deepfacelab. https://github.com/iperov/DeepFaceLab/ Accessed on: June, 2023.
- [5] FaceSwap. https://faceswap.dev Accessed on: June, 2023.
- [6] FaceSwap-GAN. https://github.com/shaoanlu/faceswap-GAN Accessed on: June, 2023.
- Agarwal et al. [2019] Shruti Agarwal, Hany Farid, Yuming Gu, Mingming He, Koki Nagano, and Hao Li. Protecting world leaders against deep fakes. In IEEE/CVF Conf. Comput. Vision Pattern Recog. Workshops, Long Beach, CA, 2019.
- Bao et al. [2018] Jianmin Bao, Dong Chen, Fang Wen, Houqiang Li, and Gang Hua. Towards open-set identity preserving face synthesis. In IEEE/CVF Conf. Comput. Vision Pattern Recog., pages 6713–6722, 2018.
- Bestagini et al. [2012] Paolo Bestagini, Ahmed Allam, Simone Milani, Marco Tagliasacchi, and Stefano Tubaro. Video codec identification. In IEEE Int. Conf. Acoust., Speech, Signal Process., pages 2257–2260, 2012.
- Bitouk et al. [2008] Dmitri Bitouk, Neeraj Kumar, Samreen Dhillon, Peter Belhumeur, and Shree K Nayar. Face swapping: Automatically replacing faces in photographs. In ACM SIGGRAPH, pages 1–8. 2008.
- Bonettini et al. [2020] Nicolo Bonettini, Edoardo Daniele Cannas, Sara Mandelli, Luca Bondi, Paolo Bestagini, and Stefano Tubaro. Video face manipulation detection through ensemble of CNNs. In IEEE Int. Conf. Learn. Pattern, 2020.
- Bromley et al. [1993] Jane Bromley, Isabelle Guyon, Yann LeCun, Eduard Säckinger, and Roopak Shah. Signature verification using a “siamese” time delay neural network. Adv. Neural Infor. Process. Syst., 6, 1993.
- Brue [1993] Stanley L Brue. Retrospectives: The law of diminishing returns. Journal of Economic Perspectives, 7(3):185–192, 1993.
- Bulat et al. [2022] Adrian Bulat, Shiyang Cheng, Jing Yang, Andrew Garbett, Enrique Sanchez, and Georgios Tzimiropoulos. Pre-training strategies and datasets for facial representation learning. In European Conference on Computer Vision, pages 107–125. Springer, 2022.
- Carnein et al. [2015] Matthias Carnein, Pascal Schöttle, and Rainer Böhme. Forensics of high-quality JPEG images with color subsampling. In IEEE Int. Workshop Informat. Forensics Security, pages 1–6, 2015.
- Chen et al. [2014] Bor-Chun Chen, Chu-Song Chen, and Winston H. Hsu. Cross-age reference coding for age-invariant face recognition and retrieval. In European Conf. Comput. Vision, 2014.
- Chen et al. [2022] Liang Chen, Yong Zhang, Yibing Song, Lingqiao Liu, and Jue Wang. Self-supervised learning of adversarial example: Towards good generalizations for deepfake detection. In IEEE/CVF Conf. Comput. Vision Pattern Recog., pages 18710–18719, 2022.
- Chen et al. [2020] Ting Chen, Simon Kornblith, Kevin Swersky, Mohammad Norouzi, and Geoffrey E Hinton. Big self-supervised models are strong semi-supervised learners. Adv. Neural Infor. Process. Syst., 33:22243–22255, 2020.
- Cheng et al. [2009] Yi-Ting Cheng, Virginia Tzeng, Yu Liang, Chuan-Chang Wang, Bing-Yu Chen, Yung-Yu Chuang, and Ming Ouhyoung. 3d-model-based face replacement in video. In SIGGRAPH. 2009.
- Chollet [2017] François Chollet. Xception: Deep learning with depthwise separable convolutions. In IEEE/CVF Conf. Comput. Vision Pattern Recog., pages 1251–1258, 2017.
- Cozzolino et al. [2021] Davide Cozzolino, Andreas Rössler, Justus Thies, Matthias Nießner, and Luisa Verdoliva. Id-reveal: Identity-aware deepfake video detection. In IEEE/CVF Int. Conf. Comput. Vision, pages 15108–15117, 2021.
- Dang et al. [2020] Hao Dang, Feng Liu, Joel Stehouwer, Xiaoming Liu, and Anil K Jain. On the detection of digital face manipulation. In IEEE/CVF Conf. Comput. Vision Pattern Recog., pages 5781–5790, 2020.
- Deng et al. [2011] Zhonghai Deng, Arjan Gijsenij, and Jingyuan Zhang. Source camera identification using auto-white balance approximation. In IEEE/CVF Int. Conf. Comput. Vision, pages 57–64, 2011.
- Dolhansky et al. [2020] Brian Dolhansky, Joanna Bitton, Ben Pflaum, Jikuo Lu, Russ Howes, Menglin Wang, and Cristian Canton Ferrer. The deepfake detection challenge dataset. arXiv preprint arXiv:2006.07397, 2020.
- Dong et al. [2022] Xiaoyi Dong, Jianmin Bao, Dongdong Chen, Ting Zhang, Weiming Zhang, Nenghai Yu, Dong Chen, Fang Wen, and Baining Guo. Protecting celebrities from deepfake with identity consistency transformer. In IEEE/CVF Conf. Comput. Vision Pattern Recog., pages 9468–9478, 2022.
- Durall et al. [2020] Ricard Durall, Margret Keuper, and Janis Keuper. Watch your up-convolution: CNN based generative deep neural networks are failing to reproduce spectral distributions. In IEEE/CVF Conf. Comput. Vision Pattern Recog., pages 7890–7899, 2020.
- Frank et al. [2020] Joel Frank, Thorsten Eisenhofer, Lea Schönherr, Asja Fischer, Dorothea Kolossa, and Thorsten Holz. Leveraging frequency analysis for deep fake image recognition. In Int. Conf. Mach. Learn., 2020.
- Guarnera et al. [2020] Luca Guarnera, Oliver Giudice, and Sebastiano Battiato. Deepfake detection by analyzing convolutional traces. In IEEE/CVF Conf. Comput. Vision Pattern Recog. Workshops, pages 666–667, 2020.
- Güera and Delp [2018] David Güera and Edward J Delp. Deepfake video detection using recurrent neural networks. In IEEE Int. Conf. Advanced Video Signal Based Surveillance, Auckland, New Zealand, 2018.
- Guo et al. [2023] Ying Guo, Cheng Zhen, and Pengfei Yan. Controllable guide-space for generalizable face forgery detection. In IEEE/CVF Int. Conf. Comput. Vision, pages 20818–20827, 2023.
- Hadsell et al. [2006] Raia Hadsell, Sumit Chopra, and Yann LeCun. Dimensionality reduction by learning an invariant mapping. In IEEE/CVF Conf. Comput. Vision Pattern Recog., pages 1735–1742, New York, NY, 2006.
- Huang et al. [2010] Fangjun Huang, Jiwu Huang, and Yun Qing Shi. Detecting double JPEG compression with the same quantization matrix. IEEE Trans. Inf. Forensics Security, 5(4):848–856, 2010.
- Jiang et al. [2020] Liming Jiang, Ren Li, Wayne Wu, Chen Qian, and Chen Change Loy. DeeperForensics-1.0: A large-scale dataset for real-world face forgery detection. In IEEE/CVF Conf. Comput. Vision Pattern Recog., pages 2889–2898, 2020.
- Karras et al. [2019] Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In IEEE/CVF Conf. Comput. Vision Pattern Recog., pages 4401–4410, 2019.
- Lai and Böhme [2013] ShiYue Lai and Rainer Böhme. Block convergence in repeated transform coding: JPEG-100 forensics, carbon dating, and tamper detection. In IEEE Int. Conf. Acoust., Speech, Signal Process., pages 3028–3032, 2013.
- Li et al. [2018a] Haodong Li, Bin Li, Shunquan Tan, and Jiwu Huang. Detection of deep network generated images using disparities in color components. arXiv preprint arXiv:1808.07276, 2018a.
- Li et al. [2019] Lingzhi Li, Jianmin Bao, Hao Yang, Dong Chen, and Fang Wen. Faceshifter: Towards high fidelity and occlusion aware face swapping. arXiv preprint arXiv:1912.13457, 2019.
- Li et al. [2018b] Yuezun Li, Ming-Ching Chang, and Siwei Lyu. In Ictu Oculi: Exposing AI created fake videos by detecting eye blinking. In IEEE Int. Workshop Informat. Forensics Security, Hong Kong, 2018b.
- Li et al. [2020] Yuezun Li, Pu Sun, Honggang Qi, and Siwei Lyu. Celeb-DF: A large-scale challenging dataset for deepfake forensics. In IEEE/CVF Conf. Comput. Vision Pattern Recog., pages 3207–3216, 2020.
- Milani et al. [2014] Simone Milani, Paolo Bestagini, Marco Tagliasacchi, and Stefano Tubaro. Demosaicing strategy identification via eigenalgorithms. In IEEE Int. Conf. Acoust., Speech, Signal Process., pages 2659–2663, 2014.
- Mirsky and Lee [2021] Yisroel Mirsky and Wenke Lee. The creation and detection of deepfakes: A survey. ACM Comput. Surveys, 54(1):1–41, 2021.
- [42] Ryota Natsume, Tatsuya Yatagawa, and Shigeo Morishima. FsNet: An identity-aware generative model for image-based face swapping. In Asian Conf. Comput. Vision, Perth, Australia, Dec. 2–6, 2018.
- Natsume et al. [2018] Ryota Natsume, Tatsuya Yatagawa, and Shigeo Morishima. RSGAN: Face swapping and editing using face and hair representation in latent spaces. arXiv preprint arXiv:1804.03447, 2018.
- Newell and Deng [2020] Alejandro Newell and Jia Deng. How useful is self-supervised pretraining for visual tasks? In IEEE/CVF Conf. Comput. Vision Pattern Recog., pages 7345–7354, 2020.
- Nirkin et al. [2019] Yuval Nirkin, Yosi Keller, and Tal Hassner. FSGAN: Subject agnostic face swapping and reenactment. In IEEE/CVF Conf. Comput. Vision Pattern Recog., pages 7184–7193, 2019.
- Radford et al. [2021] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
- Rombach et al. [2022] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In IEEE/CVF Conf. Comput. Vision Pattern Recog., pages 10684–10695, 2022.
- Rossler et al. [2019] Andreas Rossler, Davide Cozzolino, Luisa Verdoliva, Christian Riess, Justus Thies, and Matthias Nießner. Faceforensics++: Learning to detect manipulated facial images. In IEEE/CVF Int. Conf. Comput. Vision, pages 1–11, 2019.
- Saharia et al. [2022] Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding. Adv. Neural Infor. Process. Syst., 35:36479–36494, 2022.
- Spillman [1923] WJ Spillman. Application of the law of diminishing returns to some fertilizer and feed data. Journal of Farm Economics, 5(1):36–52, 1923.
- Wang et al. [2020] Sheng-Yu Wang, Oliver Wang, Richard Zhang, Andrew Owens, and Alexei A Efros. CNN-generated images are surprisingly easy to spot… for now. In IEEE/CVF Conf. Comput. Vision Pattern Recog., 2020.
- Yan et al. [2023] Zhiyuan Yan, Yong Zhang, Yanbo Fan, and Baoyuan Wu. UCF: Uncovering common features for generalizable deepfake detection. In IEEE/CVF Int. Conf. Comput. Vision, pages 22412–22423, 2023.
- Yang et al. [2014] Jianquan Yang, Jin Xie, Guopu Zhu, Sam Kwong, and Yun-Qing Shi. An effective method for detecting double JPEG compression with the same quantization matrix. IEEE Trans. Inf. Forensics Security, 9(11):1933–1942, 2014.
- Yang et al. [2019] Xin Yang, Yuezun Li, and Siwei Lyu. Exposing deep fakes using inconsistent head poses. In IEEE Int. Conf. Acoust., Speech, Signal Process., pages 8261–8265, Brighton, UK, 2019.
- Zakharov et al. [2019] Egor Zakharov, Aliaksandra Shysheya, Egor Burkov, and Victor Lempitsky. Few-shot adversarial learning of realistic neural talking head models. In IEEE/CVF Int. Conf. Comput. Vision, pages 9459–9468, 2019.
- Zhang et al. [2016] Kaipeng Zhang, Zhanpeng Zhang, Zhifeng Li, and Yu Qiao. Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Process. Letters, 23(10):1499–1503, 2016.
- Zhao et al. [2021] Hanqing Zhao, Wenbo Zhou, Dongdong Chen, Tianyi Wei, Weiming Zhang, and Nenghai Yu. Multi-attentional deepfake detection. In IEEE/CVF Conf. Comput. Vision Pattern Recog., pages 2185–2194, 2021.
- Zheng et al. [2022] Yinglin Zheng, Hao Yang, Ting Zhang, Jianmin Bao, Dongdong Chen, Yangyu Huang, Lu Yuan, Dong Chen, Ming Zeng, and Fang Wen. General facial representation learning in a visual-linguistic manner. In IEEE/CVF Conf. Comput. Vision Pattern Recog., pages 18697–18709, 2022.
Supplementary Material
1 Tool for Deepfake Generation
High-quality face-swapped videos may be generated by tools based on convolutional autoencoder models. An autoencoder consists of two neural networks, an encoder and a decoder. The encoder will map the original input to a lower-dimensional representation , where and are dimensions of the input and the embedding space, respectively, and . The decoder will reconstruct input from the lower dimensional representation, i.e., . Denote loss function . With training data , an autoencoder can be trained by minimizing a loss, . The autoencoder-based deepfake generation tool consists of a shared encoder and two decoders, as shown in Fig. 6. Faceswap-GAN \citeSfsgS is one of the most popular and effective publicly available tools based on autoencoder and can generate high-quality deepfake videos \citeSdolhansky2020deepfakeS. In this work, we train Faceswap-GAN models to reconstruct videos and generate deepfake videos. The output videos of Faceswap-GAN will contain processing traces left by the neural network.
2 Deepfake Generation and Reconstruction Model
For each target video, we generated a deepfake video by feeding a Faceswap-GAN model with the target video and a video of a public figure with a known identity. The video of the public figure was collected from YouTube and has the same gender as in the template video. For deepfake generation, we swap the face of the public figure onto the face of the person in the target video since public figures are usually the victims of deepfakes.
For deepfake detection using double operations, a journalist does not know which videos were used to generate the potentially fake video and knows only the identity of the video in question. Therefore, the reconstruction model was trained using videos of the known public figure from scenes other than the videos used for deepfake generation. The reconstructing unit is chosen to share the same neural network structure as the deepfake generating unit, i.e., a Faceswap-GAN. For a specific public figure, we trained a Faceswap-GAN using five videos, with iterations to ensure a good reconstruction quality.
3 Advantage of Identity-Conditioned Feature Extraction
Let us consider a set of images containing authentic and deepfake images. Each images is associated with an identity . may be decomposed into disjoint sets as follows:
(5) |
where is the set of all images belonging to individual , and are the sets of all authentic and deepfake images, respectively, and and are the acceptance region and rejection region partitioned by a decision rule \citeSvan2004detection.
Let us define as a powerful manifold-learning feature extractor for deepfake traces extraction so that the extracted 1-D feature for real images and fake images exhibit different distributions. To facilitate our theoretical analysis and simulation, we consider the following hypotheses concerning an observation for individual :
(6a) | ||||
(6b) |
where and have Gaussian priors, namely,
(7a) | ||||
(7b) |
where we set without loss of generality, and is the variance of the priors. Fig. 7(a) illustrates the probability density functions (PDFs) of under and for five individuals. When identity information is unknown, the PDFs under each hypothesis merges into one as shown in Fig. 7(b).
|
The Bayes risk \citeSvan2004detection for an arbitrary rejection region is defined as
(8) |
where is the probability measure, is the cost incurred by choosing when is true, and is the prior. To focus on the effect of identity conditioning, we assume that the dataset is balanced, i.e., and the incurred costs are the same, i.e., . With these assumptions, the Bayes risk is reduced to the overall error probability .
We define to further segment the acceptance region and the rejection region by individuals:
(9a) | ||||
(9b) | ||||
(9c) | ||||
(9d) |
Standard hypothesis testing technique \citeSvan2004detection allows us to derive from (9d) the optimal decision rule that minimizes the Bayes risk or error probability. One can proceed with the derivation and the decision rule turns out to be separable for each individual and in the form of the likelihood ratio test, namely,
(10) |
where is the optimal decision threshold.
Using the optimal decision rule, one can calculate the minimal error probability following (9c):
(11a) | ||||
(11b) | ||||
(11c) | ||||
(11d) | ||||
(11e) |
Here, (11c) is due to the assumption that the identities are uniformly distributed over the dataset, i.e., , is the cumulative density function (CDF) of standard Gaussian, and .
In contrast, when there is no information about the identity, the hypothesis testing problem is reduced to the basic form as shown in Fig. 7(b). One can prove the following identity-agnostic optimal decision rule:
(12) |
The minimal error probability with all identities mixed is then given by:
(13a) | ||||
(13b) |
Plugging in and using the second-order Taylor expansion on around , we obtain,
(14) |
Here, , is the second-order derivative of , and . This reveals that is larger (worse) than , highlighting the significance of identity conditioning for detection.
Fig. 7(c) demonstrates the result of and generated by a large number of iterations for . It is observed that the performance is improved when the individual distributions are used by the detector and such effect is amplified with a larger [i.e., more unique individualized deepfake traces; larger as in (14)] and with a smaller [i.e., more intrinsically difficult detection problems; smaller in (14) for ’s monotonically increasing interval on the positive half of the axis]. We used identities for this simulation and verified via simulation that the performance is not sensitive to the choice of .
4 Fine-Grained Performance Analysis Over Identities
The detection performance for an overall population of unknown composition may not be the most interesting metric from the perspective of a journalist when they target a specific celebrity or politician. Individualized deepfake detection proposed in this work allows more tailored optimization on an individual basis. The performance of the proposed individualized deepfake detector and two baseline methods for every public figure is shown in Fig. 8. The figure reveals that the performance of baseline methods is less consistent across the identity. For some identities, the performance of the baseline methods is significantly worse than their own average performance. This underscores the greater reliability and consistency of the proposed method in deepfake detection of public figures.
ieeenat_fullname \bibliographySsupp