Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

  • failed: multibib
  • failed: epic

Authors: achieve the best HTML results from your LaTeX submissions by selecting from this list of supported packages.

License: CC BY-NC-SA 4.0
arXiv:2312.08034v1 [eess.IV] 13 Dec 2023
\newcites

SReferences

Individualized Deepfake Detection Exploiting Traces Due to Double Neural-Network Operations

Mushfiqur Rahman, Runze Liu, Chau-Wai Wong, and Huaiyu Dai
North Carolina State University, Raleigh, NC, USA
{mrahman7, rliu10, chauwai.wong, hdai}@ncsu.edu
Abstract

In today’s digital landscape, journalists urgently require tools to verify the authenticity of facial images and videos depicting specific public figures before incorporating them into news stories. Existing deepfake detectors are not optimized for this detection task when an image is associated with a specific and identifiable individual. This study focuses on the deepfake detection of facial images of individual public figures. We propose to condition the proposed detector on the identity of the identified individual given the advantages revealed by our theory-driven simulations. While most detectors in the literature rely on perceptible or imperceptible artifacts present in deepfake facial images, we demonstrate that the detection performance can be improved by exploiting the idempotency property of neural networks. In our approach, the training process involves double neural-network operations where we pass an authentic image through a deepfake simulating network twice. Experimental results show that the proposed method improves the area under the curve (AUC) from 0.92 to 0.94 and reduces its standard deviation by 17%. For evaluating the detection performance of individual public figures, a facial image dataset with individuals’ names is required, a criterion not met by the current deepfake datasets. To address this, we curated a dataset comprising 32k images featuring 45 public figures, which we intend to release to the public after the paper is published.

1 Introduction

Refer to caption
Figure 1: The inference pipeline of the proposed individualized deepfake detector leveraging the near-idempotence property and identity conditioning. The identity conditioning is achieved by combining the identity-aware processing trace and the input identity vector. To leverage the idempotence property, the test image is passed through a reconstruction operator R𝑅Ritalic_R. If the test image exhibits a marginal change in the observed amount of processing traces, the test image is considered “deepfake”; if a significant change is observed, the image is considered “authentic”.

A deepfake refers to a seemingly authentic image or video generated by a deep neural network. When it comes to human faces, a manipulation method may comprise reenactment, replacement, editing, and synthesis [41]. While deepfakes can facilitate numerous appealing and advantageous applications, the act of replacing the face in a staged image or video with the face of a public figure can pose a serious threat to the society. Given the continuous influx of deepfake videos on public platforms, journalists need to pay special attention to those that relate to significant public interest, such as those featuring celebrities or politicians [41, 26]. The deepfake generation methods evolved with autoencoder-based approaches [1], GANs [6], and diffusion models [47]. The latest diffusion-based models such as [47, 49] can surpass GAN-based models in producing photorealistic images. Nevertheless, even in the present day, autoencoder-based models remain threatening in terms of malicious use. This is due to the availability of several free, downloadable, and user-friendly applications built on autoencoder, such as FaceSwap [5], Faceswap-GAN [6], DeepFaceLab [4], and df [2]. In this work, we focus on Faceswap-GAN.

Most deepfake detectors were built to detect the whole population of deepfake videos, i.e., deepfake videos of whatever identities are targeted. However, victims of deepfakes are most often public figures and their deepfake videos are more detrimental due to their widespread public exposure. In this work, we propose a deepfake image detection system customized for individual subjects. Our theory-driven simulations suggest that identity conditioning on deepfake detection tends to exhibit advantages in more challenging detection tasks. As our experimental results will show, the existing tools for deepfake face detection that encompass the whole population may work suboptimally for a specific public figure. The proposed detector for specific individuals is especially useful for journalism. For example, before reporting news based on an image of a public figure of unknown authenticity, a journalist can apply the proposed detection tool to determine its authenticity.

Our approach to deepfake detection draws inspiration from a series of studies leveraging the near-idempotence property of an operation. This method has been particularly effective in various image forensics tasks, including double JPEG compression detection, unknown video codec identification, and source camera identification  [32, 53, 9, 23, 40]. In these studies, researchers leverage the near-idempotence of a respective operation, such as certain type of JPEG compression, video compression, or color demosaicing algorithm. The strict idempotence property asserts that an idempotent operation, f()𝑓f(\cdot)italic_f ( ⋅ ), results in no change to f(x)𝑓𝑥f(x)italic_f ( italic_x ) when it is applied iteratively, i.e., f(f(x))=f(x)𝑓𝑓𝑥𝑓𝑥f(f(x))=f(x)italic_f ( italic_f ( italic_x ) ) = italic_f ( italic_x ). Using slightly different terminology, if f(f(x))𝑓𝑓𝑥f(f(x))italic_f ( italic_f ( italic_x ) ) approximately equals f(x)𝑓𝑥f(x)italic_f ( italic_x ), the operation is nearly idempotent. In many detection problems of multimedia forensics, the nearly idempotent nature of a forgery method allows an analyst to apply the forgery operation multiple times and observe the changes to determine whether the input was forged for the first time, i.e., input forged for more than once will exhibit minimal changes.

In this work, we demonstrate that near-idempotence is also applicable to the neural network-based Faceswap-GAN [6]. To explore this, we emulate the potential deepfake operation that an attacker might employ, utilizing publicly available data of a public figure and making assumptions about the neural network architecture. Fig. 1 illustrates the inference pipeline of the proposed detector. We feed a test image into the emulated deepfake generator. The expected change in the image due to this operation is dependent on whether the image has undergone a similar operation before. If the image is a deepfake, the near-idempotence property ensures that the change will be minimal. From the standpoint of the deepfake feature extractor, a deepfake image will exhibit processing traces both before and after the operation, leading to subtle observed changes. Conversely, an authentic image without the deepfake operation lacks any processing traces of the neural network, resulting in a significant observable change. The contributions of this paper are threefold.

  • We propose to use the near-idempotence property of neural networks for deepfake face detection, introducing a distinct direction of improvement compared to the state of the art. The idempotence-driven approach can potentially complement existing methods.

  • We demonstrate that identity conditioning can significantly improve the deepfake detection performance over the state-of-the-art end-to-end CNN classifiers.

  • Our detector can focus on specific individuals. Individualized detectors are better suited for journalism.

2 Related Work

2.1 Generation of Deepfake Faces

Early methods of face-swapping such as Bitouk et al. [10] were limited to using two images of two particular persons with similar poses. The images were first aligned with the help of landmark detection, then cropped, and postprocessed including color correction. Subsequent researchers [19] improved those with a 3-D facial model from the source video. The next advancement emerged after the proposal of a deep-learning-based face-swapping architecture [1] built upon one shared encoder and two individual decoders. Faceswap-GAN [6] is the GAN improvement over [1] where the performance of shared encoder and individual decoders further improve as a result of the GAN’s internal interplay mechanism between the generator and discriminator. However, the architectures of [1, 6] can swap faces between only two identities involved in training. Researchers have proposed identity agnostic architectures decoupling the identity extraction from the attribute extraction [8, 45, 42, 43, 37].

2.2 Protection Against Deepfake

Researchers have been exploring different methods to detect deepfakes. In the first category, the artifacts of synthetic videos are exploited for deepfake detection such as the absence of eye blinking [38], inconsistency in head pose. [54], disparities in color components [36], and inconsistency between inner face and outer face [25]. In the second category, researchers used either an end-to-end convolutional neural network (CNN) structure  [51] or a combined CNN with a recurrent neural network (RNN) [29]. In the third category, researchers exploit processing traces left by the neural networks for deepfake detection. The researchers exploited the features like spatial domain local convolutional features [28], spectral distortion caused by up-convolutions [26], and upsampling artifacts in the frequency domain [27].

Identity-driven Deepfake Detection. Instead of detecting deepfake videos for the whole population, recent work also exploited characteristics of a specific person for deepfake detection. Agarwal et al. [7] targeted deepfake videos of a specific individual by capturing speaking patterns. Cozzolino et al. [21] proposed to learn the temporal features of how a specific person moves and talks. Dong et al. [25] calculated 2superscript2\ell^{2}roman_ℓ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT distance between the computed identity vector from the inner face and the expected identity vector drawn from a reference set of identity vectors. In this work, we extract the deepfake traces conditioned on the identity.

2.3 Idempotency as a Multimedia Forensics Tool

In multimedia forensics, one way to detect counterfeiting is to exploit the near-idempotence property, i.e., the minor changes caused by the repetitive application of adversarial operations. It shares the same sprit of the law of diminishing returns, a widely used concept in economics [13, 50]. The detection of double JPEG compression, source camera identification, and video codec identification are three exemplary applications of the near-idempotence property. The ratio of stable image blocks has been used by researchers to detect the number of prior JPEG compressions [35, 15]. Huang et al. [32] found that the number of dissimilar JPEG coefficients between two subsequent JPEG compression decreases monotonically. Bestagini et al. [9] detected unknown video encoding by recompressing a video with each of the candidates. For source camera identification, the researchers have leveraged the near-idempotence property of an auto-white balancing method [23] and that of color demosaicing strategy [40]. In economics, the law of diminishing returns states that additional inputs to a fixed amount of identical inputs increase productivity at a decreasing rate [13]. If the additional inputs are considered repetitive operations, then the law of diminishing returns may be considered as near-idempotence. In this study, we show that the near-idempotence property of neural networks assists in deepfake image detection.

2.4 Unsupervised Pretraining

Unsupervised pertaining has been proposed for feature extraction for many tasks of computer vision. Chen et al. [18] found that larger networks, for example, larger ResNet, pretrained in an unsupervised manner followed by supervised training with only 10%percent1010\%10 % of labeled data can outperform fully supervised networks for general computer vision tasks. Newell and Deng [44] showed that pretrained networks are more advantageous in low data regimes compared to ubiquitous data. Their results suggest that pretrained networks should be tested on diverse downstream tasks. Bulat et al. [14] proposed task-agnostic self-supervised pretraining on in-the-wild facial data for representation learning. Zheng et al. [58] proposed weakly supervised facial representation learning using vast facial images available on the web with linguistic descriptions. In this work, we utilize the facial features from Bulat et al. [14] to additionally learn the deepfake traces.

3 Threat Model

In this work, we consider an attacker who is smart enough to find and use open-source face-swapping software such as [6, 5, 1] on the facial images from the publicly available videos of a public figure. More specifically, we consider Faceswap-GAN [6] as a potential method that the attacker can use. The attacker is free to use any public or private videos of a second person to depict a story and they want to convince the public of the involvement of a targeted public figure. For example, the attacker can record prearranged videos at a professional studio and later replace the actor’s face with that of the public figure. The attacker can harvest videos of the public figure from multiple sources, including social media, news channels, movies, and YouTube. Different sources of videos offer varied image quality, compression levels, and processing histories. For example, public interview videos of a public figure available on YouTube are expected to be less edited than video clips from movies. In our proposed detection method, we assume that we, as forensic analysts, have access to the various sources of public figure videos, but we do not know exactly from what source the attacker took videos for deepfake generation. For example, the attacker can use videos from social media, where we will only use public interview recordings of that public figure to train the neural network based detector.

4 Proposed Detector via Near-Idempotence and Identity Conditioning

In the challenge of identifying deepfake faces for public figures, we confront an image of unknown authenticity, claimed to be a specific public figure. Our approach to addressing this problem makes use of the extensive collection of authentic images or videos of the said public figure from YouTube. The training process of our proposed deepfake detector is depicted in Fig. 2 and the inference pipeline is shown in Fig. 1.

Refer to caption
Figure 2: The training pipeline of the proposed deepfake detector leveraging the near-idempotence property of the deepfake generator. A side-by-side comparison with conventional deepfake detectors is also shown. In the proposed method, an authentic image is passed through a deepfake simulating network or reconstruction operator, twice. Due to the near-idempotence property, the features for the first and the second outputs will be nearly identical. The features are obtained from an identity-aware feature extractor that is trained separately. We freeze the feature extractor network and train a Siamese network and an identity decoder to increase the Euclidean distance between the first pair (consisting of the authentic image and the first output image) and to decrease the Euclidean distance between the second pair (consisting of the first and the second output images).

Our proposed detector has four distinct components. First, the reconstruction operator is a neural network operation that stimulates the deepfake generation operation for a public figure. We found this operation nearly idempotent. Second, the feature extractor is finetuned with a teacher network and is able to capture the identity information while extracting the features. Third, the identity decoder takes as input the explicit identity, i.e., the index of the public figure, and learns as a constant identity vector that arguments the feature space. It contains the necessary person-specific information of that public figure, and when combined with the identity-aware feature can effectively compute the deepfake features conditioned on identity. Fourth, the Siamese network serves as the ultimate binary classification block in the proposed architecture. It learns to extract the features linked to the idempotency of the deepfake operation. It produces a larger distance before and after reconstruction for a test authentic image and a smaller distance for a test deepfake image.

4.1 Reconstruction Operator and Idempotence-Driven Detection

We employ a dedicated reconstruction operator R𝑅Ritalic_R for each public figure as shown in Fig. 1 and  Fig. 2. When the original image is authentic, the first operation generates a deepfake image, and the second operation produces a doubly processed deepfake. We verified experimentally that the reconstruction operator R𝑅Ritalic_R serves as a reliable approximation of a specific type of deepfake generation tool, such as FaceSwap-GAN [6], and that the deepfake generation process is nearly idempotent. In this context, the distance between a deepfake image and its corresponding doubly processed deepfake tends to be close to zero. This characteristic is leveraged in the training and inference system.

The next consideration is how to obtain the identity-specific reconstruction operator. For each public figure within our scope, we accumulate numerous images of that public figure and train a neural network based on an autoencoder utilizing the encoder and decoder architecture from FaceSwap-GAN [6]. This network learns the facial characteristics of the public figure, and when given a facial image of that public figure, it can reproduce approximately the same image as the output. Since the objective of this network is to replicate the input facial image of an identity, we refer to the resulting operator as the reconstruction operator or emulated deepfake generator. Some examples of reconstructed images are shown in Fig. 3.

Refer to caption
(a)
Refer to caption
(b)
Figure 3: (a) Facial regions from raw images (first row) and reconstructed images (second row). The reconstructed images are singly processed. (b) Facial regions from deepfake images (first row) and reconstructed images (second row). The reconstructed images are doubly processed. The reconstruction models trained with images from the same person result in good visual quality for both raw and deepfake images.

The reconstruction operator R𝑅Ritalic_R exhibits near-zero changes to a deepfake image due to the near-idempotence. Consequently, the feature level Euclidean distance between the two is expected to be small. On the other hand, an authentic image and its corresponding processed image will be substantially different as the operation leaves discernible traces in the processed image. Considering the capability of our deepfake feature extractor (see Sec. 4.2) to detect these traces, the features will exhibit significant dissimilarity, resulting in a higher distance compared to the deepfake scenario.

Based on the above considerations, the initial problem of detecting whether an image is authentic or deepfake is now reframed as evaluating the change of the image in the feature space through the reconstruction operation. When this change, quantified as the Euclidean distance, approaches zero, the image is classified as a deepfake; otherwise, it is considered authentic. Denoting the input image by f, the reframed problem is to evaluate whether f and R(f)𝑅fR(\text{f})italic_R ( f ) are the same or not, where R𝑅Ritalic_R is our reconstruction operator. Treating f and R(f)𝑅fR(\text{f})italic_R ( f ) as two inputs, we note that the Siamese network [12] is a powerful approach for discerning similarity or dissimilarity between two inputs. Our use of the Siamese network will be discussed in Sec. 4.4.

4.2 Identity-Aware Feature Extractor

Motivation. Conventional deepfake feature extraction network B()𝐵B(\cdot)italic_B ( ⋅ ) extracts the deepfake features B(f)𝐵fB({\rm f})italic_B ( roman_f ) for a test image ff{\rm f}roman_f ignoring the person identity II{\rm I}roman_I [11, 22, 57] or considers the identity features irrelevant to forgery detection [30, 52]. Our work found that the identity-aware feature, B(f)superscript𝐵fB^{{}^{\prime}}({\rm f})italic_B start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ( roman_f ), which extracts identity information in addition to the deepfake features, is more effective for deepfake detection. This may be explained that a distinct extracted feature may not be equally distinguishable for every identity for the classification. If a feature extractor does not allow the passing of the identity information, the later network can not learn the statistics of the features individually for each identity. This will be limited to learning the average pattern. Such average distributions of the features will lead to the error probability of the Bayesian classifier as follows:

Pecom=(H0)(C=1|H0)+(H1)(C=0|H1).superscriptsubscript𝑃ecomsubscript𝐻0𝐶conditional1subscript𝐻0subscript𝐻1𝐶conditional0subscript𝐻1\displaystyle P_{\text{e}}^{\text{com}}\!=\!\mathbb{P}(H_{0})\mathbb{P}(C\!=\!% 1|H_{0})\!+\!\mathbb{P}(H_{1})\mathbb{P}(C\!=\!0|H_{1}).italic_P start_POSTSUBSCRIPT e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT com end_POSTSUPERSCRIPT = blackboard_P ( italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) blackboard_P ( italic_C = 1 | italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) + blackboard_P ( italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) blackboard_P ( italic_C = 0 | italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) . (1)

where ()\mathbb{P}(\cdot)blackboard_P ( ⋅ ) is the probability measure, H0subscript𝐻0H_{0}italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and H1subscript𝐻1H_{1}italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT are two hypotheses, C𝐶Citalic_C is the predicted class. On the other hand, if the feature extractor allows passing the identity, the later network can distinguish the features for each identity separately. Knowing the distributions of the features for each identity separately will lead to the error probability:

Peind=1NI𝕀superscriptsubscript𝑃eind1𝑁subscriptI𝕀\displaystyle P_{\text{e}}^{\text{ind}}\!=\!\frac{1}{N}\sum_{\text{I}\in% \mathbb{I}}\,italic_P start_POSTSUBSCRIPT e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ind end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT I ∈ blackboard_I end_POSTSUBSCRIPT (H0|I)(C=1|H0,I)conditionalsubscript𝐻0I𝐶conditional1subscript𝐻0I\displaystyle\mathbb{P}(H_{0}|\text{I})\mathbb{P}(C\!=\!1|H_{0},\text{I})blackboard_P ( italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | I ) blackboard_P ( italic_C = 1 | italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , I ) (2)
+\displaystyle+\,+ (H1|I)(C=0|H1,I),conditionalsubscript𝐻1I𝐶conditional0subscript𝐻1I\displaystyle\mathbb{P}(H_{1}|\text{I})\mathbb{P}(C\!=\!0|H_{1},\text{I}),blackboard_P ( italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | I ) blackboard_P ( italic_C = 0 | italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , I ) ,

where 𝕀𝕀\mathbb{I}blackboard_I is the set of all identities. In Sec. 3 of the supplementary document, we showed that the latter identity-conditioning approach is more powerful in reducing classification error. We conducted a performance comparison between two methods through theory-driven simulations, demonstrating that Peindsuperscriptsubscript𝑃eindP_{\text{e}}^{\text{ind}}italic_P start_POSTSUBSCRIPT e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ind end_POSTSUPERSCRIPT tends to be lower (better) than Pecomsuperscriptsubscript𝑃ecomP_{\text{e}}^{\text{com}}italic_P start_POSTSUBSCRIPT e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT com end_POSTSUPERSCRIPT. Furthermore, we observed that the gain of Peindsuperscriptsubscript𝑃eindP_{\text{e}}^{\text{ind}}italic_P start_POSTSUBSCRIPT e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ind end_POSTSUPERSCRIPT over Peindsuperscriptsubscript𝑃eindP_{\text{e}}^{\text{ind}}italic_P start_POSTSUBSCRIPT e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ind end_POSTSUPERSCRIPT is more significant when the deepfake traces for individuals are more unique and the detection problem is intrinsically more difficult.

Training. To make the feature extraction network identity-aware, we use a neural network such that the earlier layers extract identity-aware features along with other features, and the later layers extract deepfake traces. We use a learned facial representation, trained by Bulat et al. [14] as the starting point of training B()superscript𝐵B^{{}^{\prime}}({\cdot})italic_B start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ( ⋅ ). Their trained network has an architecture of ResNet. For extracting deepfake features, we tune the portion of the network after the “conv4” block.

We reused the model and initial weights from Bulat et al. [14] for the following three reasons. First, having an existing network that lets personal identity pass through makes our task easier to additionally learn the deepfake traces. In comparison, training a network simultaneously for personal identity and deepfake detection would require joint training of two downstream tasks, which is harder. Second, a deeper network trained with unlabelled data is less biased to any specific portion of the dataset [18]. Bulat et al. [14] pretrained the ResNet architecture with 10similar-toabsent10{\sim}10∼ 10 million facial images. Consequently, the initial layers of the network are anticipated to learn a robust representation of features, including the identity. The network is also tested over multiple downstream tasks and therefore, it is a good candidate for extracting facial features [44]. Third, according to Newell and Deng [44], there is an advantage in unsupervised pretraining with unlabeled data when the labeled finetuning dataset is small, which aligns with our labeled training dataset comprising 295 videos from 59 celebrities.

Refer to caption
Figure 4: Backbone network training for identity-aware deepfake feature extraction. An authentic and deepfake image pair is passed through the teacher and student networks. Teacher network passes down the deepfake trace knowledge to the student network through loss functions L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. The loss function L3subscript𝐿3L_{3}italic_L start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT increases the feature distance between the authentic and the deepfake image.

The training for the backbone network B()superscript𝐵B^{{}^{\prime}}(\cdot)italic_B start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ( ⋅ ) is depicted in Fig. 4. The input is an image pair, consisting of an authentic image and its corresponding deepfake, generated using a deepfake generation tool. The input is passed through a student network Bssubscript𝐵sB_{\rm s}italic_B start_POSTSUBSCRIPT roman_s end_POSTSUBSCRIPT and a teacher network Btsubscript𝐵tB_{\rm t}italic_B start_POSTSUBSCRIPT roman_t end_POSTSUBSCRIPT in parallel. The student network is composed of the pretrained facial representation learning backbone [14] and a concatenated task adaptation head for learning specifically the deepfake traces. The layers after the “conv4” block of the pretrained backbone and the task adaptation head are the tunable portions of the student network. We then utilize the EfficientNetAutoAttB4ST [11] as the teacher network to distill the knowledge for learning the deepfake traces. To adapt the deepfake traces based on personal identity, we add a loss function L3subscript𝐿3L_{3}italic_L start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT that contrasts the learned traces of a deepfake and its corresponding authentic image in addition to the knowledge distillation losses L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. Given the authentic facial image of identity II\rm Iroman_I, fauthsubscriptfauth\rm f_{auth}roman_f start_POSTSUBSCRIPT roman_auth end_POSTSUBSCRIPT, and its corresponding deepfake image fdfsubscriptfdf\rm f_{df}roman_f start_POSTSUBSCRIPT roman_df end_POSTSUBSCRIPT, the loss terms are defined as follows:

L1subscript𝐿1\displaystyle L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT =D2(Bt(fauth),Bs(fauth)),absentsuperscript𝐷2subscript𝐵tsubscriptfauthsubscript𝐵ssubscriptfauth\displaystyle=D^{2}\Big{(}B_{\rm t}({\rm f_{auth}}),B_{\rm s}(\rm f_{auth})% \Big{)},= italic_D start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_B start_POSTSUBSCRIPT roman_t end_POSTSUBSCRIPT ( roman_f start_POSTSUBSCRIPT roman_auth end_POSTSUBSCRIPT ) , italic_B start_POSTSUBSCRIPT roman_s end_POSTSUBSCRIPT ( roman_f start_POSTSUBSCRIPT roman_auth end_POSTSUBSCRIPT ) ) , (3a)
L2subscript𝐿2\displaystyle L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT =D2(Bt(fdf),Bs(fdf)),absentsuperscript𝐷2subscript𝐵tsubscriptfdfsubscript𝐵ssubscriptfdf\displaystyle=D^{2}\Big{(}B_{\rm t}({\rm f_{df}}),B_{\rm s}(\rm f_{df})\Big{)},= italic_D start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_B start_POSTSUBSCRIPT roman_t end_POSTSUBSCRIPT ( roman_f start_POSTSUBSCRIPT roman_df end_POSTSUBSCRIPT ) , italic_B start_POSTSUBSCRIPT roman_s end_POSTSUBSCRIPT ( roman_f start_POSTSUBSCRIPT roman_df end_POSTSUBSCRIPT ) ) , (3b)
L3subscript𝐿3\displaystyle L_{3}italic_L start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT =[max(0,mD(Bs(fauth),Bs(fdf)))]2,absentsuperscriptdelimited-[]0𝑚𝐷subscript𝐵ssubscriptfauthsubscript𝐵ssubscriptfdf2\displaystyle=\Big{[}\max\Big{(}0,m\!-\!D\big{(}B_{\rm s}({\rm f_{auth}}),B_{% \rm s}({\rm f_{df}})\big{)}\Big{)}\Big{]}^{2},= [ roman_max ( 0 , italic_m - italic_D ( italic_B start_POSTSUBSCRIPT roman_s end_POSTSUBSCRIPT ( roman_f start_POSTSUBSCRIPT roman_auth end_POSTSUBSCRIPT ) , italic_B start_POSTSUBSCRIPT roman_s end_POSTSUBSCRIPT ( roman_f start_POSTSUBSCRIPT roman_df end_POSTSUBSCRIPT ) ) ) ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (3c)

where D()𝐷D(\cdot)italic_D ( ⋅ ) is the Euclidean distance and m𝑚mitalic_m is the margin of the hinge loss. The three loss terms are combined as α(L1+L2)+βL3𝛼subscript𝐿1subscript𝐿2𝛽subscript𝐿3\alpha(L_{1}+L_{2})+\beta L_{3}italic_α ( italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) + italic_β italic_L start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT, with hyperparameters α𝛼\alphaitalic_α and β𝛽\betaitalic_β. L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT contribute to the knowledge distillation for learning the deepfake traces and L3subscript𝐿3L_{3}italic_L start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT contributes to learning the deepfake traces according to identity.

4.3 Identity Decoder for Feature Conditioning

Our identity decoder is a single-layer fully-connected neural network that maps the one-hot-encoded index of a public figure to the feature space generated by our feature extractor. We combine the output of the identity decoder with the output of the identity-aware feature extractor that contains the joint information of the deepfake feature and identity. The extra marginal information provided by the identity decoder can have the effect of conditioning the identity-aware feature, in a similar spirit as in the Bayes rule.

4.4 Contrastive Learning

The Siamese network contains two identical subnetworks that process the two inputs parallelly. The subnetworks learn a manifold for each of the inputs adopting contrastive loss that allows a powerful discrimination between the two inputs. In our work, we designed each of the subnetworks as a single-layer neural network that takes as input the features of the corresponding image and outputs a vector of length 50505050. We experimentally verified that this length is enough for discriminating the two cases. Let us call the two subnetworks of the Siamese network Sn1subscript𝑆subscript𝑛1S_{n_{1}}italic_S start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT and Sn2subscript𝑆subscript𝑛2S_{n_{2}}italic_S start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT, where the first one processes the features of f and the second one processes the features of R(f)𝑅fR(\text{f})italic_R ( f ). We used contrastive loss [31] to train the Siamese network as follows:

L(f,R(f),Y)=(1Y)DSn2+Y[max(0,mDSn)]2,𝐿f𝑅f𝑌1𝑌subscriptsuperscript𝐷2subscript𝑆𝑛𝑌superscriptdelimited-[]0𝑚subscript𝐷subscript𝑆𝑛2\begin{split}L\big{(}\text{f},\!R(\text{f}),\!Y\big{)}\!=\!(1\!-\!Y)D^{2}_{S_{% n}}{+}Y\!\left[\max\left(0,m\!-\!D_{S_{n}}\right)\right]^{2},\end{split}start_ROW start_CELL italic_L ( f , italic_R ( f ) , italic_Y ) = ( 1 - italic_Y ) italic_D start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT + italic_Y [ roman_max ( 0 , italic_m - italic_D start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , end_CELL end_ROW (4)

where DSnsubscript𝐷subscript𝑆𝑛D_{S_{n}}italic_D start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT is the Euclidean distance between the processed manifolds, i.e., DSn=Sn1(X1)Sn2(X2)2subscript𝐷subscript𝑆𝑛subscriptnormsubscript𝑆subscript𝑛1subscript𝑋1subscript𝑆subscript𝑛2subscript𝑋22D_{S_{n}}=\left\|S_{n_{1}}\left(X_{1}\right)-S_{n_{2}}\left(X_{2}\right)\right% \|_{2}italic_D start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT = ∥ italic_S start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) - italic_S start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, X1subscript𝑋1X_{1}italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is the identity-conditioned features of f, X2subscript𝑋2X_{2}italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is the identity-conditioned features of R(f)𝑅fR(\text{f})italic_R ( f ), m>0𝑚0m>0italic_m > 0 is a margin, and Y{0,1}𝑌01Y\in\{0,1\}italic_Y ∈ { 0 , 1 } is the known binary label of f, i.e., is 1111 if f authentic, and 00 otherwise. We learned the weights of the identity decoder and the two subnetworks of the Siamese network using this loss function. Additionally, in contrast to the standard Siamese network, we decoupled the weights of the two subnetworks, Sn1subscript𝑆subscript𝑛1S_{n_{1}}italic_S start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT and Sn2subscript𝑆subscript𝑛2S_{n_{2}}italic_S start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT, similar to CLIP [46], resulting in performance enhancement.

5 Experimental Results

5.1 Dataset Curation

Deepfake detection methods generally perform well in the in-dataset evaluation, but the performance drops significantly in cross-dataset evaluation [17]. To provide reliable measurements for the performance of the deepfake detection method, we report only the cross-dataset evaluation results. As our deepfake detector conditions on the identity, we need two independent datasets containing facial images of the same set of identities. Most of the public deepfake detection datasets such as DFDC [24], DFD [3], Deeper Forensics [33] do not explicitly mention identity information associated with the videos. It is hard to find the same persons in another dataset, which would be necessary to perform the cross-dataset evaluation of individualized deepfake detection. To the best of our knowledge, our work is the first work that evaluates the deepfake detection performance on another dataset for each person separately. We curate a dataset from Celeb-DF [39] and CACD [16]. Our curated dataset contains 32k32𝑘32k32 italic_k facial images of 45454545 public figures sourced from YouTube videos for the train set and from the cross-age facial image dataset of the same public figures for the test dataset. We plan to release the source code and dataset for public use after publication.

For the training dataset, we use real videos from the Celeb-DF dataset [39], which is a popular deepfake detection dataset of 59 public figures. We sample frames from the videos at 5555 frames per second (fps), and detect faces from the videos using the MTCNN [56] face detection network. For each individual i𝑖iitalic_i, we have facial images fi,j,kcdfsubscriptsuperscript𝑓cdf𝑖𝑗𝑘f^{\text{cdf}}_{i,j,k}italic_f start_POSTSUPERSCRIPT cdf end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j , italic_k end_POSTSUBSCRIPT from the j𝑗jitalic_jth authentic Celeb-DF video, where j{1,,10}𝑗110j\in\{1,...,10\}italic_j ∈ { 1 , … , 10 }, k{1,,Nf}𝑘1subscript𝑁𝑓k\in\{1,...,N_{f}\}italic_k ∈ { 1 , … , italic_N start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT }, and Nfsubscript𝑁𝑓N_{f}italic_N start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT is the number of the frames extracted from the video.

Examining multiple candidate datasets, we narrowed down to the CACD [16] dataset for cross-dataset evaluation. CACD [16] contains cross-age facial images of 2,000 public figures with an overlap of 45 public figures with the Celeb-DF dataset. From the CACD dataset, we have authentic images fi,j,kcacdsubscriptsuperscript𝑓cacd𝑖𝑗𝑘f^{\text{cacd}}_{i,j,k}italic_f start_POSTSUPERSCRIPT cacd end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j , italic_k end_POSTSUBSCRIPT for the i𝑖iitalic_ith identity and j𝑗jitalic_jth available age group of that identity, where j{1,,10}𝑗110j\in\{1,...,10\}italic_j ∈ { 1 , … , 10 }, k{1,,Ni}𝑘1subscript𝑁𝑖k\in\{1,...,N_{i}\}italic_k ∈ { 1 , … , italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT }, and Nisubscript𝑁𝑖N_{i}italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the number of the images available for that age group. To generate deepfake faces for the test set, for each person i𝑖iitalic_i, we choose another identity m𝑚mitalic_m from the database of 2,000 persons and then train a Faceswap-GAN [6] model using the facial images fi,j,kcacdsubscriptsuperscript𝑓cacd𝑖𝑗𝑘f^{\text{cacd}}_{i,j,k}italic_f start_POSTSUPERSCRIPT cacd end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j , italic_k end_POSTSUBSCRIPT and fm,j,kcacdsubscriptsuperscript𝑓cacd𝑚𝑗𝑘f^{\text{cacd}}_{m,j,k}italic_f start_POSTSUPERSCRIPT cacd end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m , italic_j , italic_k end_POSTSUBSCRIPT.

5.2 Experimental Setup

As our proposed deepfake detector is designed to operate only when the identity is provided, we conduct our evaluation on our curated dataset comprising two distinct subsets of facial images with an overlap of 45454545 public figures. We use the cropped facial images from the video frames of Celeb-DF [39] as the training set and the images from CACD [16] as the test set.

Our proposed method had two stages of training. In the first stage, we trained the identity-aware feature extractor. For this training, we resized the facial images to 224-by-224 and used random cropping and random horizontal flipping for image augmentation. As shown in Fig. 4, we used a pair of images for the backbone training. We enforced identical cropping within the same pair consisting of an authentic image and its deepfake. We used SGD optimizer and a learning rate of 103superscript10310^{-3}10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT. The contrastive loss margin was 50505050 and the values of α𝛼\alphaitalic_α and β𝛽\betaitalic_β were varied manually within [α,β](0,1)2𝛼𝛽superscript012[\alpha,\beta]\in(0,1)^{2}[ italic_α , italic_β ] ∈ ( 0 , 1 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. We set [α,β]𝛼𝛽[\alpha,\beta][ italic_α , italic_β ] to [1,0]10[1,0][ 1 , 0 ] during the initial 1,500 epochs of training and subsequently modified it to [0,1]01[0,1][ 0 , 1 ] for the next 1,500 epochs. In the second stage, we trained the Siamese network and the identity decoder. For this training, we used Adam optimizer, and the contrastive loss margin was 2222 and the learning rate was determined by the grid search within the range of (106,105)superscript106superscript105(10^{-6},10^{-5})( 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT , 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT ).

For face reconstructor training, we separated the facial images from the last five videos fi,j,kcdf,j{6,,10}subscriptsuperscript𝑓cdf𝑖𝑗𝑘𝑗610f^{\text{cdf}}_{i,j,k},j\in\{6,\dots,10\}italic_f start_POSTSUPERSCRIPT cdf end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j , italic_k end_POSTSUBSCRIPT , italic_j ∈ { 6 , … , 10 } of the Celeb-DF dataset. For the final classification network training, we randomly selected facial images from one video fi,j,kcdf,j{1,,5}subscriptsuperscript𝑓cdf𝑖𝑗𝑘𝑗15f^{\text{cdf}}_{i,j,k},j\in\{1,\dots,5\}italic_f start_POSTSUPERSCRIPT cdf end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j , italic_k end_POSTSUBSCRIPT , italic_j ∈ { 1 , … , 5 } as the validation set and facial images from other four videos as the training set. We repeated this process four times to ensure the results would be statistically stable. As for the test set, we used all of the real and face-swapped images that we generated from CACD. In each training session, the neural network with the smallest validation loss was chosen as the final network for the test set.

5.3 Performance Gain

In this subsection, we compare the performance of our proposed method with two state-of-the-art methods. The first baseline considered is the Xception [20] network trained on the FaceForensics++ dataset [48] with deepfake videos generated by four methods including Faceswap [5]. The second one is the EfficientNetAutoAttB4ST [11] network trained on the DFDC dataset [24], a dataset consisting of deepfake videos generated by various popular face-swapping methods, such as Facewap-GAN [6], StyleGAN [34], Faceswap [5], and NTH [55]. The performance of the two baseline approaches on our test dataset is presented in Tab. 1.

Table 1: Comparison of deepfake detection performance of the proposed method leveraging near-idempotence and identity conditioning with the state of the art.
Method    AUC  AUC AUC trimmed
Mean (SD) Median (IQR)  Mean (10%)
Xception [20] 0.792 (0.11) 0.799 (0.14)     0.799
Xception [20] (tuned) 0.887 (0.07) 0.896 (0.09)     0.894
EfficientNet [11] 0.728 (0.13) 0.733 (0.16)     0.732
EfficientNet [11] (tuned) 0.920 (0.06) 0.926 (0.07)     0.927
Proposed 0.940 (0.05) 0.958 (0.05)     0.947

To ensure a fair comparison with our proposed method, we conducted fine-tuning on these two baseline methods using our training dataset. This involved keeping the features frozen and training a classification layer on top of the features until the performance was saturated on the validation dataset. After finetuning, EfficientNetAutoAttB4ST [11] had an AUC mean of 0.920.920.920.92 across identities with a sample standard deviation of 0.060.060.060.06.

To evaluate the idea of utilizing idempotency and identity conditioning, we applied the double neural network operation and obtained the features from our trained identity-aware feature extractor. We concatenated those with the features of EfficientNetAutoAttB4ST [11]. Tab. 1 reveals that the proposed method can achieve an AUC mean of 0.940.940.940.94 across identities, an increase of 0.020.020.020.02 from Bonettini et al. [11]. The AUC median across identities was 0.9580.9580.9580.958 with a gain of 0.0320.0320.0320.032 from the baseline [11]. The 10%-trimmed mean was 0.9470.9470.9470.947 with a gain of 0.020.020.020.02. The AUC standard deviation was reduced by 0.010.010.010.01 or 17%percent1717\%17 % and the AUC interquartile range was reduced by 0.020.020.020.02 or 29%percent2929\%29 % compared to the baseline [11]. This result demonstrates that idempotency and identity conditioning can improve performance in validity and variation. The detection results on the test dataset for six of the 45454545 public figures are shown in Fig. 5.

Refer to caption
Figure 5: ROC curves for deepfake detection using the proposed method. Each plot contains results from a public figure and each curve represents a trial of training the network. AUC values are large with small standard deviations, indicating good performance.

The averaged AUC value among all public figures is 0.9400.9400.9400.940 and the sample standard deviation is 0.050.050.050.05. We also performed t𝑡titalic_t-tests and the proposed method is significantly better than those of the off-the-shelf detectors in terms of AUC. The larger variance of the AUC values of the baseline methods implies that the deepfake detector may perform convincingly for one identity, but it has a greater risk of exhibiting unacceptable performance for others. This makes the baseline methods less attractive for journalists.

5.4 Ablation Studies

Tab. 2 displays the results of ablation studies. In the first ablation study, we applied our idempotent strategy (with identity decoder) using the EfficientNetAutoAttB4ST features. In the second study, we concatenated the features from the identity-aware feature extractor with the features of EfficientNetAutoAttB4ST as we did in our proposed method and used a feedforward network to classify the images. The first ablation achieved the AUC mean of 0.9260.9260.9260.926 and the AUC median was 0.9280.9280.9280.928. The sample standard deviation and interquartile range were 0.050.050.050.05 and 0.060.060.060.06. The second ablation achieved the AUC mean of 0.8930.8930.8930.893 and the AUC median was 0.9200.9200.9200.920. The sample standard deviation and interquartile range were 0.100.100.100.10 and 0.130.130.130.13. The achieved AUC values are much lower compared to the proposed method. This confirms that the identity conditioning and idempotence strategy have synergy (positive interaction).

Table 2: Ablation studies for the proposed method.
Method    AUC  AUC AUC trimmed
Mean (SD) Median (IQR)  Mean (10%)
Proposed 0.940 (0.05) 0.958 (0.05)     0.947
Idempotence 0.926 (0.05) 0.928 (0.06)     0.932
Identity-aware features 0.893 (0.10) 0.920 (0.13)     0.904

6 Discussion

In the current work, the reconstruction model is the deepfake generation tool, i.e., Faceswap-GAN, one of the most popular and effective off-the-shelf tools for deepfake generation. When the reconstruction model does not match the tool used for deepfake generation, the processing traces caused by deepfake and reconstruction may be different, but both traces are generated by neural networks. For example, the reconstruction model is a Faceswap-GAN model whereas the deepfake video is generated by a Faceswap model. A more sophisticated classifier may be needed to exploit the processing traces left by different network structures. Under the proposed double operations framework, a more general Siamese neural network for processing-trace manifold learning may work, but we leave this exploration to future work.

Compared to end-to-end CNN-based classifiers, our proposed method targets deepfake detection for individuals, with main applications on public figures. Although our method needs training the reconstruction models, the training can be done in advance for each public figure. For example, a journalist can train the reconstruction models for various candidates before they need to verify videos for reporting tasks. Journalists may also share or collaboratively train detectors within their professional networks. To let the detection system support a new individual, the journalist will need to train a reconstruction operator for that individual and then finetune the Siamese network.

7 Conclusion and Future Work

In this work, we have proposed to use the method of double neural network operations and individual conditioning for the deepfake detection. The proposed detector can achieve better detection performance than end-to-end CNN-based detectors on our curated dataset of public figures with identity labels. We have found that utilizing identity information can make the deepfake detector more reliable. In future work, we plan to extend the double-operations detection to scenarios with mismatched neural network architectures.

References

  • [1] Deepfakes. https://github.com/deepfakes/faceswap Accessed on: June, 2023.
  • [2] Deepfake. https://github.com/dfaker/df Accessed on: June, 2023.
  • [3] Deep fake detection dataset. https://ai.googleblog.com/2019/09/contributing-data-to-deepfake-detection.html Accessed on: June, 2023.
  • [4] Deepfacelab. https://github.com/iperov/DeepFaceLab/ Accessed on: June, 2023.
  • [5] FaceSwap. https://faceswap.dev Accessed on: June, 2023.
  • [6] FaceSwap-GAN. https://github.com/shaoanlu/faceswap-GAN Accessed on: June, 2023.
  • Agarwal et al. [2019] Shruti Agarwal, Hany Farid, Yuming Gu, Mingming He, Koki Nagano, and Hao Li. Protecting world leaders against deep fakes. In IEEE/CVF Conf. Comput. Vision Pattern Recog. Workshops, Long Beach, CA, 2019.
  • Bao et al. [2018] Jianmin Bao, Dong Chen, Fang Wen, Houqiang Li, and Gang Hua. Towards open-set identity preserving face synthesis. In IEEE/CVF Conf. Comput. Vision Pattern Recog., pages 6713–6722, 2018.
  • Bestagini et al. [2012] Paolo Bestagini, Ahmed Allam, Simone Milani, Marco Tagliasacchi, and Stefano Tubaro. Video codec identification. In IEEE Int. Conf. Acoust., Speech, Signal Process., pages 2257–2260, 2012.
  • Bitouk et al. [2008] Dmitri Bitouk, Neeraj Kumar, Samreen Dhillon, Peter Belhumeur, and Shree K Nayar. Face swapping: Automatically replacing faces in photographs. In ACM SIGGRAPH, pages 1–8. 2008.
  • Bonettini et al. [2020] Nicolo Bonettini, Edoardo Daniele Cannas, Sara Mandelli, Luca Bondi, Paolo Bestagini, and Stefano Tubaro. Video face manipulation detection through ensemble of CNNs. In IEEE Int. Conf. Learn. Pattern, 2020.
  • Bromley et al. [1993] Jane Bromley, Isabelle Guyon, Yann LeCun, Eduard Säckinger, and Roopak Shah. Signature verification using a “siamese” time delay neural network. Adv. Neural Infor. Process. Syst., 6, 1993.
  • Brue [1993] Stanley L Brue. Retrospectives: The law of diminishing returns. Journal of Economic Perspectives, 7(3):185–192, 1993.
  • Bulat et al. [2022] Adrian Bulat, Shiyang Cheng, Jing Yang, Andrew Garbett, Enrique Sanchez, and Georgios Tzimiropoulos. Pre-training strategies and datasets for facial representation learning. In European Conference on Computer Vision, pages 107–125. Springer, 2022.
  • Carnein et al. [2015] Matthias Carnein, Pascal Schöttle, and Rainer Böhme. Forensics of high-quality JPEG images with color subsampling. In IEEE Int. Workshop Informat. Forensics Security, pages 1–6, 2015.
  • Chen et al. [2014] Bor-Chun Chen, Chu-Song Chen, and Winston H. Hsu. Cross-age reference coding for age-invariant face recognition and retrieval. In European Conf. Comput. Vision, 2014.
  • Chen et al. [2022] Liang Chen, Yong Zhang, Yibing Song, Lingqiao Liu, and Jue Wang. Self-supervised learning of adversarial example: Towards good generalizations for deepfake detection. In IEEE/CVF Conf. Comput. Vision Pattern Recog., pages 18710–18719, 2022.
  • Chen et al. [2020] Ting Chen, Simon Kornblith, Kevin Swersky, Mohammad Norouzi, and Geoffrey E Hinton. Big self-supervised models are strong semi-supervised learners. Adv. Neural Infor. Process. Syst., 33:22243–22255, 2020.
  • Cheng et al. [2009] Yi-Ting Cheng, Virginia Tzeng, Yu Liang, Chuan-Chang Wang, Bing-Yu Chen, Yung-Yu Chuang, and Ming Ouhyoung. 3d-model-based face replacement in video. In SIGGRAPH. 2009.
  • Chollet [2017] François Chollet. Xception: Deep learning with depthwise separable convolutions. In IEEE/CVF Conf. Comput. Vision Pattern Recog., pages 1251–1258, 2017.
  • Cozzolino et al. [2021] Davide Cozzolino, Andreas Rössler, Justus Thies, Matthias Nießner, and Luisa Verdoliva. Id-reveal: Identity-aware deepfake video detection. In IEEE/CVF Int. Conf. Comput. Vision, pages 15108–15117, 2021.
  • Dang et al. [2020] Hao Dang, Feng Liu, Joel Stehouwer, Xiaoming Liu, and Anil K Jain. On the detection of digital face manipulation. In IEEE/CVF Conf. Comput. Vision Pattern Recog., pages 5781–5790, 2020.
  • Deng et al. [2011] Zhonghai Deng, Arjan Gijsenij, and Jingyuan Zhang. Source camera identification using auto-white balance approximation. In IEEE/CVF Int. Conf. Comput. Vision, pages 57–64, 2011.
  • Dolhansky et al. [2020] Brian Dolhansky, Joanna Bitton, Ben Pflaum, Jikuo Lu, Russ Howes, Menglin Wang, and Cristian Canton Ferrer. The deepfake detection challenge dataset. arXiv preprint arXiv:2006.07397, 2020.
  • Dong et al. [2022] Xiaoyi Dong, Jianmin Bao, Dongdong Chen, Ting Zhang, Weiming Zhang, Nenghai Yu, Dong Chen, Fang Wen, and Baining Guo. Protecting celebrities from deepfake with identity consistency transformer. In IEEE/CVF Conf. Comput. Vision Pattern Recog., pages 9468–9478, 2022.
  • Durall et al. [2020] Ricard Durall, Margret Keuper, and Janis Keuper. Watch your up-convolution: CNN based generative deep neural networks are failing to reproduce spectral distributions. In IEEE/CVF Conf. Comput. Vision Pattern Recog., pages 7890–7899, 2020.
  • Frank et al. [2020] Joel Frank, Thorsten Eisenhofer, Lea Schönherr, Asja Fischer, Dorothea Kolossa, and Thorsten Holz. Leveraging frequency analysis for deep fake image recognition. In Int. Conf. Mach. Learn., 2020.
  • Guarnera et al. [2020] Luca Guarnera, Oliver Giudice, and Sebastiano Battiato. Deepfake detection by analyzing convolutional traces. In IEEE/CVF Conf. Comput. Vision Pattern Recog. Workshops, pages 666–667, 2020.
  • Güera and Delp [2018] David Güera and Edward J Delp. Deepfake video detection using recurrent neural networks. In IEEE Int. Conf. Advanced Video Signal Based Surveillance, Auckland, New Zealand, 2018.
  • Guo et al. [2023] Ying Guo, Cheng Zhen, and Pengfei Yan. Controllable guide-space for generalizable face forgery detection. In IEEE/CVF Int. Conf. Comput. Vision, pages 20818–20827, 2023.
  • Hadsell et al. [2006] Raia Hadsell, Sumit Chopra, and Yann LeCun. Dimensionality reduction by learning an invariant mapping. In IEEE/CVF Conf. Comput. Vision Pattern Recog., pages 1735–1742, New York, NY, 2006.
  • Huang et al. [2010] Fangjun Huang, Jiwu Huang, and Yun Qing Shi. Detecting double JPEG compression with the same quantization matrix. IEEE Trans. Inf. Forensics Security, 5(4):848–856, 2010.
  • Jiang et al. [2020] Liming Jiang, Ren Li, Wayne Wu, Chen Qian, and Chen Change Loy. DeeperForensics-1.0: A large-scale dataset for real-world face forgery detection. In IEEE/CVF Conf. Comput. Vision Pattern Recog., pages 2889–2898, 2020.
  • Karras et al. [2019] Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In IEEE/CVF Conf. Comput. Vision Pattern Recog., pages 4401–4410, 2019.
  • Lai and Böhme [2013] ShiYue Lai and Rainer Böhme. Block convergence in repeated transform coding: JPEG-100 forensics, carbon dating, and tamper detection. In IEEE Int. Conf. Acoust., Speech, Signal Process., pages 3028–3032, 2013.
  • Li et al. [2018a] Haodong Li, Bin Li, Shunquan Tan, and Jiwu Huang. Detection of deep network generated images using disparities in color components. arXiv preprint arXiv:1808.07276, 2018a.
  • Li et al. [2019] Lingzhi Li, Jianmin Bao, Hao Yang, Dong Chen, and Fang Wen. Faceshifter: Towards high fidelity and occlusion aware face swapping. arXiv preprint arXiv:1912.13457, 2019.
  • Li et al. [2018b] Yuezun Li, Ming-Ching Chang, and Siwei Lyu. In Ictu Oculi: Exposing AI created fake videos by detecting eye blinking. In IEEE Int. Workshop Informat. Forensics Security, Hong Kong, 2018b.
  • Li et al. [2020] Yuezun Li, Pu Sun, Honggang Qi, and Siwei Lyu. Celeb-DF: A large-scale challenging dataset for deepfake forensics. In IEEE/CVF Conf. Comput. Vision Pattern Recog., pages 3207–3216, 2020.
  • Milani et al. [2014] Simone Milani, Paolo Bestagini, Marco Tagliasacchi, and Stefano Tubaro. Demosaicing strategy identification via eigenalgorithms. In IEEE Int. Conf. Acoust., Speech, Signal Process., pages 2659–2663, 2014.
  • Mirsky and Lee [2021] Yisroel Mirsky and Wenke Lee. The creation and detection of deepfakes: A survey. ACM Comput. Surveys, 54(1):1–41, 2021.
  • [42] Ryota Natsume, Tatsuya Yatagawa, and Shigeo Morishima. FsNet: An identity-aware generative model for image-based face swapping. In Asian Conf. Comput. Vision, Perth, Australia, Dec. 2–6, 2018.
  • Natsume et al. [2018] Ryota Natsume, Tatsuya Yatagawa, and Shigeo Morishima. RSGAN: Face swapping and editing using face and hair representation in latent spaces. arXiv preprint arXiv:1804.03447, 2018.
  • Newell and Deng [2020] Alejandro Newell and Jia Deng. How useful is self-supervised pretraining for visual tasks? In IEEE/CVF Conf. Comput. Vision Pattern Recog., pages 7345–7354, 2020.
  • Nirkin et al. [2019] Yuval Nirkin, Yosi Keller, and Tal Hassner. FSGAN: Subject agnostic face swapping and reenactment. In IEEE/CVF Conf. Comput. Vision Pattern Recog., pages 7184–7193, 2019.
  • Radford et al. [2021] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  • Rombach et al. [2022] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In IEEE/CVF Conf. Comput. Vision Pattern Recog., pages 10684–10695, 2022.
  • Rossler et al. [2019] Andreas Rossler, Davide Cozzolino, Luisa Verdoliva, Christian Riess, Justus Thies, and Matthias Nießner. Faceforensics++: Learning to detect manipulated facial images. In IEEE/CVF Int. Conf. Comput. Vision, pages 1–11, 2019.
  • Saharia et al. [2022] Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding. Adv. Neural Infor. Process. Syst., 35:36479–36494, 2022.
  • Spillman [1923] WJ Spillman. Application of the law of diminishing returns to some fertilizer and feed data. Journal of Farm Economics, 5(1):36–52, 1923.
  • Wang et al. [2020] Sheng-Yu Wang, Oliver Wang, Richard Zhang, Andrew Owens, and Alexei A Efros. CNN-generated images are surprisingly easy to spot… for now. In IEEE/CVF Conf. Comput. Vision Pattern Recog., 2020.
  • Yan et al. [2023] Zhiyuan Yan, Yong Zhang, Yanbo Fan, and Baoyuan Wu. UCF: Uncovering common features for generalizable deepfake detection. In IEEE/CVF Int. Conf. Comput. Vision, pages 22412–22423, 2023.
  • Yang et al. [2014] Jianquan Yang, Jin Xie, Guopu Zhu, Sam Kwong, and Yun-Qing Shi. An effective method for detecting double JPEG compression with the same quantization matrix. IEEE Trans. Inf. Forensics Security, 9(11):1933–1942, 2014.
  • Yang et al. [2019] Xin Yang, Yuezun Li, and Siwei Lyu. Exposing deep fakes using inconsistent head poses. In IEEE Int. Conf. Acoust., Speech, Signal Process., pages 8261–8265, Brighton, UK, 2019.
  • Zakharov et al. [2019] Egor Zakharov, Aliaksandra Shysheya, Egor Burkov, and Victor Lempitsky. Few-shot adversarial learning of realistic neural talking head models. In IEEE/CVF Int. Conf. Comput. Vision, pages 9459–9468, 2019.
  • Zhang et al. [2016] Kaipeng Zhang, Zhanpeng Zhang, Zhifeng Li, and Yu Qiao. Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Process. Letters, 23(10):1499–1503, 2016.
  • Zhao et al. [2021] Hanqing Zhao, Wenbo Zhou, Dongdong Chen, Tianyi Wei, Weiming Zhang, and Nenghai Yu. Multi-attentional deepfake detection. In IEEE/CVF Conf. Comput. Vision Pattern Recog., pages 2185–2194, 2021.
  • Zheng et al. [2022] Yinglin Zheng, Hao Yang, Ting Zhang, Jianmin Bao, Dongdong Chen, Yangyu Huang, Lu Yuan, Dong Chen, Ming Zeng, and Fang Wen. General facial representation learning in a visual-linguistic manner. In IEEE/CVF Conf. Comput. Vision Pattern Recog., pages 18697–18709, 2022.
\thetitle

Supplementary Material

1 Tool for Deepfake Generation

High-quality face-swapped videos may be generated by tools based on convolutional autoencoder models. An autoencoder consists of two neural networks, an encoder and a decoder. The encoder fencsubscript𝑓encf_{\mathrm{enc}}italic_f start_POSTSUBSCRIPT roman_enc end_POSTSUBSCRIPT will map the original input xd1𝑥superscriptsubscript𝑑1x\in\mathbb{R}^{d_{1}}italic_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT to a lower-dimensional representation fenc(x)d2subscript𝑓enc𝑥superscriptsubscript𝑑2f_{\mathrm{enc}}(x)\in\mathbb{R}^{d_{2}}italic_f start_POSTSUBSCRIPT roman_enc end_POSTSUBSCRIPT ( italic_x ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, where d1subscript𝑑1d_{1}italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and d2subscript𝑑2d_{2}italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are dimensions of the input and the embedding space, respectively, and d1>d2subscript𝑑1subscript𝑑2d_{1}>d_{2}italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT > italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. The decoder fdecsubscript𝑓decf_{\mathrm{dec}}italic_f start_POSTSUBSCRIPT roman_dec end_POSTSUBSCRIPT will reconstruct input x𝑥xitalic_x from the lower dimensional representation, i.e., x=fdec(fenc(x))superscript𝑥subscript𝑓decsubscript𝑓enc𝑥x^{\prime}=f_{\mathrm{dec}}(f_{\mathrm{enc}}(x))italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_f start_POSTSUBSCRIPT roman_dec end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT roman_enc end_POSTSUBSCRIPT ( italic_x ) ). Denote loss function L:d1×d1:𝐿superscriptsubscript𝑑1superscriptsubscript𝑑1L:\mathbb{R}^{d_{1}}\times\mathbb{R}^{d_{1}}\rightarrow\mathbb{R}italic_L : blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT × blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT → blackboard_R. With training data {xi}i=1Nsuperscriptsubscriptsubscript𝑥𝑖𝑖1𝑁\{x_{i}\}_{i=1}^{N}{ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT, an autoencoder can be trained by minimizing a loss, 1Ni=1NL(xi,xi)1𝑁superscriptsubscript𝑖1𝑁𝐿subscript𝑥𝑖superscriptsubscript𝑥𝑖\frac{1}{N}\sum_{i=1}^{N}L(x_{i},x_{i}^{\prime})divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_L ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ). The autoencoder-based deepfake generation tool consists of a shared encoder and two decoders, as shown in Fig. 6. Faceswap-GAN \citeSfsgS is one of the most popular and effective publicly available tools based on autoencoder and can generate high-quality deepfake videos \citeSdolhansky2020deepfakeS. In this work, we train Faceswap-GAN models to reconstruct videos and generate deepfake videos. The output videos of Faceswap-GAN will contain processing traces left by the neural network.

2 Deepfake Generation and Reconstruction Model R𝑅Ritalic_R

For each target video, we generated a deepfake video by feeding a Faceswap-GAN model with the target video and a video of a public figure with a known identity. The video of the public figure was collected from YouTube and has the same gender as in the template video. For deepfake generation, we swap the face of the public figure onto the face of the person in the target video since public figures are usually the victims of deepfakes.

For deepfake detection using double operations, a journalist does not know which videos were used to generate the potentially fake video and knows only the identity of the video in question. Therefore, the reconstruction model was trained using videos of the known public figure from scenes other than the videos used for deepfake generation. The reconstructing unit R𝑅Ritalic_R is chosen to share the same neural network structure as the deepfake generating unit, i.e., a Faceswap-GAN. For a specific public figure, we trained a Faceswap-GAN using five videos, with 40,0004000040,\!00040 , 000 iterations to ensure a good reconstruction quality.

Refer to caption
Figure 6: A schematic for autoencoder-based deepfake generator. The two encoders have shared weights and are used to extract features of the faces from both persons A and B. The decoders for A and B are trained to reconstruct the faces of persons A and B, respectively. A dashed arrow represents the deepfake generation step, i.e., feature A is sent to decoder B and the reconstructed image is the face of person B with facial expressions from person A.

3 Advantage of Identity-Conditioned Feature Extraction

Let us consider a set of images S𝑆Sitalic_S containing authentic and deepfake images. Each images is associated with an identity k{1,,K}𝑘1𝐾k\in\{1,\dots,K\}italic_k ∈ { 1 , … , italic_K }. S𝑆Sitalic_S may be decomposed into disjoint sets as follows:

S=k=1KS(k)=SauthSdf =S0S1,𝑆superscriptsubscript𝑘1𝐾superscript𝑆𝑘subscript𝑆authsubscript𝑆df subscript𝑆0subscript𝑆1S=\bigcup_{k=1}^{K}S^{(k)}=S_{\text{auth}}\cup S_{\text{df }}=S_{0}\cup S_{1},italic_S = ⋃ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_S start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT = italic_S start_POSTSUBSCRIPT auth end_POSTSUBSCRIPT ∪ italic_S start_POSTSUBSCRIPT df end_POSTSUBSCRIPT = italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∪ italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , (5)

where S(k)superscript𝑆𝑘S^{(k)}italic_S start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT is the set of all images belonging to individual k𝑘kitalic_k, Sauthsubscript𝑆authS_{\text{auth}}italic_S start_POSTSUBSCRIPT auth end_POSTSUBSCRIPT and Sdfsubscript𝑆dfS_{\text{df}}italic_S start_POSTSUBSCRIPT df end_POSTSUBSCRIPT are the sets of all authentic and deepfake images, respectively, and S0subscript𝑆0S_{0}italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and S1subscript𝑆1S_{1}italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT are the acceptance region and rejection region partitioned by a decision rule \citeSvan2004detection.

Let us define g:S:𝑔𝑆g:S\to\mathbb{R}italic_g : italic_S → blackboard_R as a powerful manifold-learning feature extractor for deepfake traces extraction so that the extracted 1-D feature x=g(f)𝑥𝑔fx=g(\text{f})italic_x = italic_g ( f ) for real images fSauthfsubscript𝑆auth\text{f}\in S_{\text{auth}}f ∈ italic_S start_POSTSUBSCRIPT auth end_POSTSUBSCRIPT and fake images fSdffsubscript𝑆df\text{f}\in S_{\text{df}}f ∈ italic_S start_POSTSUBSCRIPT df end_POSTSUBSCRIPT exhibit different distributions. To facilitate our theoretical analysis and simulation, we consider the following hypotheses concerning an observation x𝑥xitalic_x for individual k𝑘kitalic_k:

H0:x=g(f):subscript𝐻0𝑥𝑔f\displaystyle H_{0}:\ x=g\big{(}\text{f}\big{)}italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT : italic_x = italic_g ( f ) 𝒩(μ0(k),σ2),fS(k)Sauth,formulae-sequencesimilar-toabsent𝒩superscriptsubscript𝜇0𝑘superscript𝜎2fsuperscript𝑆𝑘subscript𝑆auth\displaystyle\sim\mathcal{N}\big{(}\mu_{0}^{(k)},\sigma^{2}\big{)},\quad\text{% f}\in S^{(k)}\cap S_{\text{auth}},∼ caligraphic_N ( italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) , f ∈ italic_S start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ∩ italic_S start_POSTSUBSCRIPT auth end_POSTSUBSCRIPT , (6a)
H1:x=g(f):subscript𝐻1𝑥𝑔f\displaystyle H_{1}:\ x=g\big{(}\text{f}\big{)}italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT : italic_x = italic_g ( f ) 𝒩(μ1(k),σ2),fS(k)Sdf,formulae-sequencesimilar-toabsent𝒩superscriptsubscript𝜇1𝑘superscript𝜎2fsuperscript𝑆𝑘subscript𝑆df\displaystyle\sim\mathcal{N}\big{(}\mu_{1}^{(k)},\sigma^{2}\big{)},\quad\text{% f}\in S^{(k)}\cap S_{\text{df}},∼ caligraphic_N ( italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) , f ∈ italic_S start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ∩ italic_S start_POSTSUBSCRIPT df end_POSTSUBSCRIPT , (6b)

where μ0(k)superscriptsubscript𝜇0𝑘\mu_{0}^{(k)}italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT and μ1(k)superscriptsubscript𝜇1𝑘\mu_{1}^{(k)}italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT have Gaussian priors, namely,

μ0(k)superscriptsubscript𝜇0𝑘\displaystyle\mu_{0}^{(k)}italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT 𝒩(u0,σμ2),similar-toabsent𝒩subscript𝑢0superscriptsubscript𝜎𝜇2\displaystyle\sim\mathcal{N}\left(u_{0},\sigma_{\mu}^{2}\right),∼ caligraphic_N ( italic_u start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) , (7a)
μ1(k)superscriptsubscript𝜇1𝑘\displaystyle\mu_{1}^{(k)}italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT 𝒩(u1,σμ2),similar-toabsent𝒩subscript𝑢1superscriptsubscript𝜎𝜇2\displaystyle\sim\mathcal{N}\left(u_{1},\sigma_{\mu}^{2}\right),∼ caligraphic_N ( italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) , (7b)

where we set 0=u0<u10subscript𝑢0subscript𝑢10=u_{0}<u_{1}\in\mathbb{R}0 = italic_u start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT < italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∈ blackboard_R without loss of generality, and σμ2superscriptsubscript𝜎𝜇2\sigma_{\mu}^{2}italic_σ start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT is the variance of the priors. Fig. 7(a) illustrates the probability density functions (PDFs) of x=g(f)𝑥𝑔fx=g(\text{f})italic_x = italic_g ( f ) under H0subscript𝐻0H_{0}italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and H1subscript𝐻1H_{1}italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT for five individuals. When identity information is unknown, the PDFs under each hypothesis merges into one as shown in Fig. 7(b).

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Figure 7: Theory-driven simulation results: (a) probability density functions of extracted deepfake feature for K=5𝐾5K=5italic_K = 5 identities. Different identities’ feature can have different distributions, as reflected by different μ0(k)superscriptsubscript𝜇0𝑘\mu_{0}^{(k)}italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT with prior u0subscript𝑢0u_{0}italic_u start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and different μ1(k)superscriptsubscript𝜇1𝑘\mu_{1}^{(k)}italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT with prior u1subscript𝑢1u_{1}italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT for each identity k𝑘kitalic_k; (b) combined probability density function of extracted deepfake feature. If the identity information is not considered, then the individual distributions will mix into a single distribution; and (c) deepfake detection performance with and without the knowledge of the identity. Detection performance is better when the identity information is known. A larger gain can be achieved for the case of more unique individualized deepfake traces (larger σμsubscript𝜎𝜇\sigma_{\mu}italic_σ start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT) and more difficult detection problems (smaller |u1u0|subscript𝑢1subscript𝑢0|u_{1}-u_{0}|| italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_u start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT |).

The Bayes risk \citeSvan2004detection for an arbitrary rejection region S1subscript𝑆1S_{1}italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is defined as

r(S1)=C10(S1|H0)(H0)+C01(S0|H1)(H1),𝑟subscript𝑆1subscript𝐶10conditionalsubscript𝑆1subscript𝐻0subscript𝐻0subscript𝐶01conditionalsubscript𝑆0subscript𝐻1subscript𝐻1\displaystyle r(S_{1})=C_{10}\!\operatorname{\mathbb{P}}(S_{1}|H_{0})\!% \operatorname{\mathbb{P}}(H_{0})\!+\!C_{01}\!\operatorname{\mathbb{P}}(S_{0}|H% _{1})\!\operatorname{\mathbb{P}}(H_{1}),italic_r ( italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) = italic_C start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT blackboard_P ( italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) blackboard_P ( italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) + italic_C start_POSTSUBSCRIPT 01 end_POSTSUBSCRIPT blackboard_P ( italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) blackboard_P ( italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , (8)

where ()\mathbb{P}(\cdot)blackboard_P ( ⋅ ) is the probability measure, Cijsubscript𝐶𝑖𝑗C_{ij}italic_C start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT is the cost incurred by choosing Hisubscript𝐻𝑖H_{i}italic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT when Hjsubscript𝐻𝑗H_{j}italic_H start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is true, and (Hi)subscript𝐻𝑖\operatorname{\mathbb{P}}(H_{i})blackboard_P ( italic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) is the prior. To focus on the effect of identity conditioning, we assume that the dataset S𝑆Sitalic_S is balanced, i.e., (H0)=(H1)=0.5subscript𝐻0subscript𝐻10.5\operatorname{\mathbb{P}}(H_{0})=\operatorname{\mathbb{P}}(H_{1})=0.5blackboard_P ( italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) = blackboard_P ( italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) = 0.5 and the incurred costs are the same, i.e., C01=C10=1subscript𝐶01subscript𝐶101C_{01}=C_{10}=1italic_C start_POSTSUBSCRIPT 01 end_POSTSUBSCRIPT = italic_C start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT = 1. With these assumptions, the Bayes risk is reduced to the overall error probability Pesubscript𝑃eP_{\text{e}}italic_P start_POSTSUBSCRIPT e end_POSTSUBSCRIPT.

We define Si(k)=SiS(k)superscriptsubscript𝑆𝑖𝑘subscript𝑆𝑖superscript𝑆𝑘S_{i}^{(k)}=S_{i}\cap S^{(k)}italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT = italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∩ italic_S start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT to further segment the acceptance region S0subscript𝑆0S_{0}italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and the rejection region S1subscript𝑆1S_{1}italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT by individuals:

Pesubscript𝑃e\displaystyle P_{\text{e}}italic_P start_POSTSUBSCRIPT e end_POSTSUBSCRIPT =12[(S1|H0)+(S0|H1)]absent12delimited-[]conditionalsubscript𝑆1subscript𝐻0conditionalsubscript𝑆0subscript𝐻1\displaystyle=\frac{1}{2}\big{[}\operatorname{\mathbb{P}}(S_{1}|H_{0})+% \operatorname{\mathbb{P}}(S_{0}|H_{1})\big{]}= divide start_ARG 1 end_ARG start_ARG 2 end_ARG [ blackboard_P ( italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) + blackboard_P ( italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ] (9a)
=12[(k=1KS1(k)|H0)+(k=1KS0(k)|H1)]absent12delimited-[]superscriptsubscript𝑘1𝐾conditionalsuperscriptsubscript𝑆1𝑘subscript𝐻0superscriptsubscript𝑘1𝐾conditionalsuperscriptsubscript𝑆0𝑘subscript𝐻1\displaystyle=\frac{1}{2}\Big{[}\operatorname{\mathbb{P}}\big{(}\cup_{k=1}^{K}% S_{1}^{(k)}|H_{0}\big{)}+\operatorname{\mathbb{P}}\big{(}\cup_{k=1}^{K}S_{0}^{% (k)}|H_{1}\big{)}\Big{]}= divide start_ARG 1 end_ARG start_ARG 2 end_ARG [ blackboard_P ( ∪ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT | italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) + blackboard_P ( ∪ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT | italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ] (9b)
=12[k=1K(S1(k)|H0)+k=1K(S0(k)|H1)]absent12delimited-[]superscriptsubscript𝑘1𝐾conditionalsuperscriptsubscript𝑆1𝑘subscript𝐻0superscriptsubscript𝑘1𝐾conditionalsuperscriptsubscript𝑆0𝑘subscript𝐻1\displaystyle=\frac{1}{2}\left[\sum_{k=1}^{K}\operatorname{\mathbb{P}}(S_{1}^{% (k)}|H_{0})+\sum_{k=1}^{K}\operatorname{\mathbb{P}}(S_{0}^{(k)}|H_{1})\right]= divide start_ARG 1 end_ARG start_ARG 2 end_ARG [ ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT blackboard_P ( italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT | italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) + ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT blackboard_P ( italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT | italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ] (9c)
=12{1+k=1K[(S1(k)|H0)(S1(k)|H1)]}.absent121superscriptsubscript𝑘1𝐾delimited-[]conditionalsuperscriptsubscript𝑆1𝑘subscript𝐻0conditionalsuperscriptsubscript𝑆1𝑘subscript𝐻1\displaystyle=\frac{1}{2}\bigg{\{}1+\sum_{k=1}^{K}\left[\operatorname{\mathbb{% P}}(S_{1}^{(k)}|H_{0})-\operatorname{\mathbb{P}}(S_{1}^{(k)}|H_{1})\right]% \bigg{\}}.= divide start_ARG 1 end_ARG start_ARG 2 end_ARG { 1 + ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT [ blackboard_P ( italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT | italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - blackboard_P ( italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT | italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ] } . (9d)

Standard hypothesis testing technique \citeSvan2004detection allows us to derive from (9d) the optimal decision rule that minimizes the Bayes risk or error probability. One can proceed with the derivation and the decision rule turns out to be separable for each individual k𝑘kitalic_k and in the form of the likelihood ratio test, namely,

S1(k)={x>μ0(k)+μ1(k)2=T(k)},superscriptsubscript𝑆1𝑘𝑥superscriptsubscript𝜇0𝑘superscriptsubscript𝜇1𝑘2superscript𝑇𝑘S_{1}^{(k)}=\left\{x>\frac{\mu_{0}^{(k)}+\mu_{1}^{(k)}}{2}=T^{(k)}\right\},italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT = { italic_x > divide start_ARG italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT + italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG = italic_T start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT } , (10)

where T(k)superscript𝑇𝑘T^{(k)}italic_T start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT is the optimal decision threshold.

Using the optimal decision rule, one can calculate the minimal error probability following  (9c):

Peind=12k=1K[\displaystyle P_{\text{e}}^{\text{ind}}=\frac{1}{2}\sum_{k=1}^{K}\Big{[}italic_P start_POSTSUBSCRIPT e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ind end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT [ (S1(k)|H0)+(S0(k)|H1)]\displaystyle\operatorname{\mathbb{P}}(S_{1}^{(k)}|H_{0})+\operatorname{% \mathbb{P}}(S_{0}^{(k)}|H_{1})\Big{]}blackboard_P ( italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT | italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) + blackboard_P ( italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT | italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ] (11a)
=12k=1K[\displaystyle=\frac{1}{2}\sum_{k=1}^{K}\Big{[}= divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT [ (S1(k)|S(k),H0)(S(k)|H0)conditionalsuperscriptsubscript𝑆1𝑘superscript𝑆𝑘subscript𝐻0conditionalsuperscript𝑆𝑘subscript𝐻0\displaystyle\operatorname{\mathbb{P}}(S_{1}^{(k)}|S^{(k)},H_{0})\operatorname% {\mathbb{P}}(S^{(k)}|H_{0})blackboard_P ( italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT | italic_S start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT , italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) blackboard_P ( italic_S start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT | italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT )
+\displaystyle++ (S0(k)|S(k),H1)(S(k)|H1)]\displaystyle\operatorname{\mathbb{P}}(S_{0}^{(k)}|S^{(k)},H_{1})\operatorname% {\mathbb{P}}(S^{(k)}|H_{1})\Big{]}blackboard_P ( italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT | italic_S start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT , italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) blackboard_P ( italic_S start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT | italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ] (11b)
=12Kk=1K[\displaystyle=\frac{1}{2K}\!\sum_{k=1}^{K}\!\Big{[}\!= divide start_ARG 1 end_ARG start_ARG 2 italic_K end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT [ (S1(k)|S(k),H0)+(S0(k)|S(k),H1)]\displaystyle\operatorname{\mathbb{P}}(S_{1}^{(k)}\!|S^{(k)}\!,\!H_{0})\!+\!% \operatorname{\mathbb{P}}(S_{0}^{(k)}\!|S^{(k)}\!,\!H_{1})\!\Big{]}blackboard_P ( italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT | italic_S start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT , italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) + blackboard_P ( italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT | italic_S start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT , italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ] (11c)
=12Kk=1K[\displaystyle=\frac{1}{2K}\sum_{k=1}^{K}\Big{[}= divide start_ARG 1 end_ARG start_ARG 2 italic_K end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT [ 1Φ(T(k)μ0(k)σ)+Φ(T(k)μ1(k)σ)]\displaystyle 1\!-\!\Phi\Big{(}\tfrac{T^{(k)}-\mu_{0}^{(k)}}{\sigma}\Big{)}\!+% \!\Phi\Big{(}\tfrac{T^{(k)}-\mu_{1}^{(k)}}{\sigma}\Big{)}\Big{]}1 - roman_Φ ( divide start_ARG italic_T start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT - italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT end_ARG start_ARG italic_σ end_ARG ) + roman_Φ ( divide start_ARG italic_T start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT - italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT end_ARG start_ARG italic_σ end_ARG ) ] (11d)
=1Kk=1Kabsent1𝐾superscriptsubscript𝑘1𝐾\displaystyle=\frac{1}{K}\sum_{k=1}^{K}\,\,= divide start_ARG 1 end_ARG start_ARG italic_K end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT Φ(dk).\displaystyle\Phi\left(-d_{k}\right).\quad\blacksquareroman_Φ ( - italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) . ■ (11e)

Here, (11c) is due to the assumption that the identities are uniformly distributed over the dataset, i.e., (S(k)|H0)=(S(k)|H1)=1/Kconditionalsuperscript𝑆𝑘subscript𝐻0conditionalsuperscript𝑆𝑘subscript𝐻11𝐾\operatorname{\mathbb{P}}(S^{(k)}|H_{0})=\mathbb{P}(S^{(k)}|H_{1})=1/Kblackboard_P ( italic_S start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT | italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) = blackboard_P ( italic_S start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT | italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) = 1 / italic_K, ΦΦ\Phiroman_Φ is the cumulative density function (CDF) of standard Gaussian, and dk=(μ1(k)μ0(k))/2σsubscript𝑑𝑘superscriptsubscript𝜇1𝑘superscriptsubscript𝜇0𝑘2𝜎d_{k}=\big{(}\mu_{1}^{(k)}-\mu_{0}^{(k)}\big{)}\big{/}2\sigmaitalic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = ( italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT - italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ) / 2 italic_σ.

In contrast, when there is no information about the identity, the hypothesis testing problem is reduced to the basic form as shown in Fig. 7(b). One can prove the following identity-agnostic optimal decision rule:

S1(k)={x>u0+u12=T},k.superscriptsubscript𝑆1𝑘𝑥subscript𝑢0subscript𝑢12𝑇for-all𝑘S_{1}^{(k)}=\left\{x>\frac{u_{0}+u_{1}}{2}=T\right\},\ \forall k.italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT = { italic_x > divide start_ARG italic_u start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG = italic_T } , ∀ italic_k . (12)

The minimal error probability Pecomsuperscriptsubscript𝑃ecomP_{\text{e}}^{\text{com}}italic_P start_POSTSUBSCRIPT e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT com end_POSTSUPERSCRIPT with all identities mixed is then given by:

Pecomsuperscriptsubscript𝑃ecom\displaystyle P_{\text{e}}^{\text{com}}italic_P start_POSTSUBSCRIPT e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT com end_POSTSUPERSCRIPT =12k=1K[(S1(k)|H0)+(S0(k)|H1)]absent12superscriptsubscript𝑘1𝐾delimited-[]conditionalsuperscriptsubscript𝑆1𝑘subscript𝐻0conditionalsuperscriptsubscript𝑆0𝑘subscript𝐻1\displaystyle=\frac{1}{2}\sum_{k=1}^{K}\Big{[}\operatorname{\mathbb{P}}(S_{1}^% {(k)}|H_{0})+\operatorname{\mathbb{P}}(S_{0}^{(k)}|H_{1})\Big{]}= divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT [ blackboard_P ( italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT | italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) + blackboard_P ( italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT | italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ] (13a)
=12Kk=1K[1Φ(Tμ0(k)σ)+Φ(Tμ1(k)σ)].absent12𝐾superscriptsubscript𝑘1𝐾delimited-[]1Φ𝑇superscriptsubscript𝜇0𝑘𝜎Φ𝑇superscriptsubscript𝜇1𝑘𝜎\displaystyle=\frac{1}{2K}\sum_{k=1}^{K}\Big{[}1-\Phi\big{(}\tfrac{T-\mu_{0}^{% (k)}}{\sigma}\big{)}+\Phi\big{(}\tfrac{T-\mu_{1}^{(k)}}{\sigma}\big{)}\Big{]}.= divide start_ARG 1 end_ARG start_ARG 2 italic_K end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT [ 1 - roman_Φ ( divide start_ARG italic_T - italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT end_ARG start_ARG italic_σ end_ARG ) + roman_Φ ( divide start_ARG italic_T - italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT end_ARG start_ARG italic_σ end_ARG ) ] . (13b)

Plugging in T𝑇Titalic_T and using the second-order Taylor expansion on Φ()Φ\Phi(\cdot)roman_Φ ( ⋅ ) around dksubscript𝑑𝑘d_{k}italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, we obtain,

PecomPeind+12Kk=1K[Φ′′(dk)]αk2.\displaystyle P_{\text{e}}^{\text{com}}\approx P_{\text{e}}^{\text{ind}}+\frac% {1}{2K}\sum_{k=1}^{K}\left[-\Phi^{\prime\prime}\left(d_{k}\right)\right]\alpha% _{k}^{2}.\quad\blacksquareitalic_P start_POSTSUBSCRIPT e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT com end_POSTSUPERSCRIPT ≈ italic_P start_POSTSUBSCRIPT e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ind end_POSTSUPERSCRIPT + divide start_ARG 1 end_ARG start_ARG 2 italic_K end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT [ - roman_Φ start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ( italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ] italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . ■ (14)

Here, αk=[(u0μ0(k))+(u1μ1(k))]/2σsubscript𝛼𝑘delimited-[]subscript𝑢0superscriptsubscript𝜇0𝑘subscript𝑢1superscriptsubscript𝜇1𝑘2𝜎\alpha_{k}=\big{[}(u_{0}-\mu_{0}^{(k)})+(u_{1}-\mu_{1}^{(k)})\big{]}\big{/}2\sigmaitalic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = [ ( italic_u start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ) + ( italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ) ] / 2 italic_σ, Φ′′()superscriptΦ′′\Phi^{\prime\prime}(\cdot)roman_Φ start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ( ⋅ ) is the second-order derivative of ΦΦ\Phiroman_Φ, and Φ′′(dk)>0superscriptΦ′′subscript𝑑𝑘0-\Phi^{\prime\prime}(d_{k})>0- roman_Φ start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ( italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) > 0. This reveals that Pecomsuperscriptsubscript𝑃ecomP_{\text{e}}^{\text{com}}italic_P start_POSTSUBSCRIPT e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT com end_POSTSUPERSCRIPT is larger (worse) than Peindsuperscriptsubscript𝑃eindP_{\text{e}}^{\text{ind}}italic_P start_POSTSUBSCRIPT e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ind end_POSTSUPERSCRIPT, highlighting the significance of identity conditioning for detection.

Fig. 7(c) demonstrates the result of Peindsuperscriptsubscript𝑃eindP_{\text{e}}^{\text{ind}}italic_P start_POSTSUBSCRIPT e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ind end_POSTSUPERSCRIPT and Pecomsuperscriptsubscript𝑃ecomP_{\text{e}}^{\text{com}}italic_P start_POSTSUBSCRIPT e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT com end_POSTSUPERSCRIPT generated by a large number of iterations for u1u0{0.5σ, 1.0σ, 1.5σ}subscript𝑢1subscript𝑢00.5𝜎1.0𝜎1.5𝜎u_{1}-u_{0}\in\{0.5\sigma,\,1.0\sigma,\,1.5\sigma\}italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_u start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ { 0.5 italic_σ , 1.0 italic_σ , 1.5 italic_σ }. It is observed that the performance is improved when the individual distributions are used by the detector and such effect is amplified with a larger σμsubscript𝜎𝜇\sigma_{\mu}italic_σ start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT [i.e., more unique individualized deepfake traces; larger |αk|subscript𝛼𝑘|\alpha_{k}|| italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | as in (14)] and with a smaller |u1u0|subscript𝑢1subscript𝑢0|u_{1}-u_{0}|| italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_u start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | [i.e., more intrinsically difficult detection problems; smaller dksubscript𝑑𝑘d_{k}italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT in (14) for Φ′′()superscriptΦ′′\Phi^{\prime\prime}(\cdot)roman_Φ start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ( ⋅ )’s monotonically increasing interval on the positive half of the axis]. We used K=5𝐾5K=5italic_K = 5 identities for this simulation and verified via simulation that the performance is not sensitive to the choice of K𝐾Kitalic_K.

4 Fine-Grained Performance Analysis Over Identities

The detection performance for an overall population of unknown composition may not be the most interesting metric from the perspective of a journalist when they target a specific celebrity or politician. Individualized deepfake detection proposed in this work allows more tailored optimization on an individual basis. The performance of the proposed individualized deepfake detector and two baseline methods for every public figure is shown in Fig. 8. The figure reveals that the performance of baseline methods is less consistent across the identity. For some identities, the performance of the baseline methods is significantly worse than their own average performance. This underscores the greater reliability and consistency of the proposed method in deepfake detection of public figures.

Refer to caption
Figure 8: The performance of the deepfake detectors, measured in 1AUC1AUC1-\text{AUC}1 - AUC (the smaller, the better), varies with the identities. The red and cyan peaks reveal that the baseline methods without utilizing identity information are less likely to perform well for specific individuals.
\bibliographystyleS

ieeenat_fullname \bibliographySsupp