Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Celeb-Df: A Large-Scale Challenging Dataset For Deepfake Forensics

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

Celeb-DF: A Large-scale Challenging Dataset for DeepFake Forensics

Yuezun Li1 , Xin Yang1 , Pu Sun2 , Honggang Qi2 and Siwei Lyu1
1
University at Albany, State University of New York, USA
2
University of Chinese Academy of Sciences, China
arXiv:1909.12962v4 [cs.CR] 16 Mar 2020

Abstract videos is an enabling factor in the development of DeepFake


detection method. To date, we have the UADFV dataset
AI-synthesized face-swapping videos, commonly known [53], the DeepFake-TIMIT dataset (DF-TIMIT) [25], the
as DeepFakes, is an emerging problem threatening the FaceForenscics++ dataset (FF-DF) [40]2 , the Google Deep-
trustworthiness of online information. The need to de- Fake detection dataset (DFD) [15], and the FaceBook Deep-
velop and evaluate DeepFake detection algorithms calls for Fake detection challenge (DFDC) dataset [14].
large-scale datasets. However, current DeepFake datasets However, a closer look at the DeepFake videos in exist-
suffer from low visual quality and do not resemble Deep- ing datasets reveals stark contrasts in visual quality to the
Fake videos circulated on the Internet. We present a new actual DeepFake videos circulated on the Internet. Several
large-scale challenging DeepFake video dataset, Celeb- common visual artifacts that can be found in these datasets
DF, which contains 5, 639 high-quality DeepFake videos of are highlighted in Fig.1, including low-quality synthesized
celebrities generated using improved synthesis process. We faces, visible splicing boundaries, color mismatch, visible
conduct a comprehensive evaluation of DeepFake detection parts of the original face, and inconsistent synthesized face
methods and datasets to demonstrate the escalated level of orientations. These artifacts are likely the result of imper-
challenges posed by Celeb-DF. fect steps of the synthesis method and the lack of curating
of the synthesized videos before included in the datasets.
Moreover, DeepFake videos with such low visual qualities
1. Introduction can hardly be convincing, and are unlikely to have real im-
pact. Correspondingly, high detection performance on these
A recent twist to the disconcerting problem of online dis- dataset may not bear strong relevance when the detection
information is falsified videos created by AI technologies, methods are deployed in the wild.
in particular, deep neural networks (DNNs). Although fab- In this work, we present a new large-scale and chal-
rication and manipulation of digital images and videos are lenging DeepFake video dataset, Celeb-DF3 , for the devel-
not new [16], the use of DNNs has made the process to cre- opment and evaluation of DeepFake detection algorithms.
ate convincing fake videos increasingly easier and faster. There are in total 5, 639 DeepFake videos, correspond-
One particular type of DNN-based fake videos, com- ing more than 2 million frames, in the Celeb-DF dataset.
monly known as DeepFakes, has recently drawn much at- The real source videos are based on publicly available
tention. In a DeepFake video, the faces of a target individ- YouTube video clips of 59 celebrities of diverse genders,
ual are replaced by the faces of a donor individual synthe- ages, and ethic groups. The DeepFake videos are generated
sized by DNN models, retaining the target’s facial expres- using an improved DeepFake synthesis method. As a re-
sions and head poses. Since faces are intrinsically associ- sult, the overall visual quality of the synthesized DeepFake
ated with identity, well-crafted DeepFakes can create illu- videos in Celeb-DF is greatly improved when compared to
sions of a person’s presence and activities that do not occur existing datasets, with significantly fewer notable visual ar-
in reality, which can lead to serious political, social, finan- tifacts, see Fig.2. Based on the Celeb-DF dataset and other
cial, and legal consequences [11]. existing datasets, we conduct an evaluation of current Deep-
With the escalated concerns over the DeepFakes, there is Fake detection methods. This is the most comprehensive
a surge of interest in developing DeepFakes detection meth- performance evaluation of DeepFake detection methods to
ods recently [6, 17, 27, 53, 33, 28, 41, 40, 35, 34, 36], with
an upcoming dedicated global DeepFake Detection Chal- 2 FaceForensics++ contains other types of fake videos. We consider
lenge1 . The availability of large-scale datasets of DeepFake only the DeepFake videos.
3 http://www.cs.albany.edu/ lsw/
˜
1 https://deepfakedetectionchallenge.ai. celeb-deepfakeforensics.html.
2. Backgrounds
2.1. DeepFake Video Generation
UADFV

Although in recent years there have been many sophis-


ticated algorithms for generating realistic synthetic face
videos [9, 13, 46, 51, 26, 47, 37, 20, 23, 10, 21, 50], most
of these have not been in mainstream as open-source soft-
ware tools that anyone can use. It is a much simpler method
based on the work of neural image style transfer [29] that
DF-TIMIT-HQ

becomes the tool of choice to create DeepFake videos in


scale, with several independent open-source implementa-
tions, e.g., FakeApp [5], DFaker [2], faceswap-GAN
[3], faceswap [4], and DeepFaceLab [1]. We refer to
this method as the basic DeepFake maker, and it is under-
neath many DeepFake videos circulated on the Internet or
in the existing datasets.
The overall pipeline of the basic DeepFake maker is
shown in Fig.3 (left). From an input video, faces of the
target are detected, from which facial landmarks are fur-
ther extracted. The landmarks are used to align the faces
FF-DF

to a standard configuration [22]. The aligned faces are then


cropped and fed to an auto-encoder [24] to synthesize faces
of the donor with the same facial expressions as the original
target’s faces.
The auto-encoder is usually formed by two convolun-
tional neural networks (CNNs), i.e., the encoder and the
decoder. The encoder E converts the input target’s face to
a vector known as the code. To ensure the encoder capture
DFD

identity-independent attributes such as facial expressions,


there is one single encoder regardless the identities of the
subjects. On the other hand, each identity has a dedicated
decoder Di , which generates a face of the corresponding
subject from the code. The encoder and decoder are trained
in tandem using uncorresponded face sets of multiple sub-
jects in an unsupervised manner, Fig.3 (right). Specifically,
an encoder-decoder pair is formed alternatively using E and
DFDC

Di for input face of each subject, and optimize their param-


eters to minimize the reconstruction errors (`1 difference
between the input and reconstructed faces). The parameter
update is performed with the back-propagation until conver-
Figure 1. Visual artifacts of DeepFake videos in existing datasets.
gence.
Note some common types of visual artifacts in these video frames, The synthesized faces are then warped back to the con-
including low-quality synthesized faces (row 1 col 1, row 3 col figuration of the original target’s faces and trimmed with
2, row 5 col 3), visible splicing boundaries (row 3 col 1, row 4 a mask from the facial landmarks. The last step involves
col 2, row 5 col 2), color mismatch (row 5 col 1), visible parts smoothing the boundaries between the synthesized regions
of the original face (row 1 col 1, row 2 col 1, row 4 col 3), and and the original video frames. The whole process is auto-
inconsistent synthesized face orientations (row 3 col 3). This figure matic and runs with little manual intervention.
is best viewed in color.
2.2. DeepFake Detection Methods
date. The results show that Celeb-DF is challenging to most Since DeepFakes become a global phenomenon, there
of the existing detection methods, even though many Deep- has been an increasing interest in DeepFake detection meth-
Fake detection methods are shown to achieve high, some- ods. Most of the current DeepFake detection methods use
times near perfect, accuracy on previous datasets. data-driven deep neural networks (DNNs) as backbone.
Figure 2. Example frames from the Celeb-DF dataset. Left column is the frame of real videos and right five columns are corresponding
DeepFake frames generated using different donor subject.

(d)
(g)
(e)
(f) Encoder Encoder Shared Encoder
(𝐸) (𝐸) (𝐸)
Face detection Landmark extraction Face alignment
L1 loss

Code Code Code


L1 loss

(g) (g)
Decoder Decoder Decoder
(𝐷1 ) (𝐷2 ) (𝐷1 )

Boundary smooth masking Affine warping

Figure 3. Synthesis (left) and training (right) of the basic DeepFake maker algorithm. See texts for more details.
Since synthesized faces are spliced into the original the physical/physiological aspects in the DeepFake videos.
video frames, state-of-the-art DNN splicing detection meth- The method in work of [27] exploits the observation that
ods, e.g., [54, 55, 30, 8], can be applied. There have many DeepFake videos lack reasonable eye blinking due
also been algorithms dedicated to the detection of Deep- to the use of online portraits as training data, which usu-
Fake videos that fall into three categories. Methods in ally do not have closed eyes for aesthetic reasons. Incoher-
the first category are based on inconsistencies exhibited in ent head poses in DeepFake videos are utilized in [53] to
# Real # DeepFake ing two different synthesis algorithms, but the details of the
Dataset Release Date
Video Frame Video Frame
synthesis algorithm are not disclosed.
UADFV 49 17.3k 49 17.3k 2018.11
DF-TIMIT-LQ ∗ 320 34.0k Based on release time and synthesis algorithms, we cat-
320 34.0k 2018.12
DF-TIMIT-HQ 320 34.0k egorize UADFV, DF-TIMIT, and FF-DF as the first genera-
FF-DF 1,000 509.9k 1,000 509.9k 2019.01 tion of DeepFake datasets, while DFD, DFDC, and the pro-
DFD 363 315.4k 3,068 2,242.7k 2019.09
posed Celeb-DF datasets are the second generation. In gen-
DFDC 1,131 488.4k 4,113 1,783.3k 2019.10
eral, the second generation datasets improve in both quan-
Celeb-DF 590 225.4k 5,639 2,116.8k 2019.11
Table 1. Basic information of various DeepFake video datasets. ∗: tity and quality over the first generation.
the original videos in DF-TIMIT are from Vid-TIMIT dataset.
expose DeepFake videos. In [7], the idiosyncratic behav- 3. The Celeb-DF Dataset
ioral patterns of a particular individual are captured by the Although the current DeepFake datasets have sufficient
time series of facial landmarks extracted from real videos number of videos, as discussed in the Introduction and
are used to spot DeepFake videos. The second category of demonstrate in Fig.1, DeepFake videos in these datasets
DeepFake detection algorithms (e.g., [33, 28]) use signal- have various visual artifacts that easily distinguish them
level artifacts introduced during the synthesis process such from the real videos. To provide more relevant data to eval-
as those described in the Introduction. The third category uate and support the future development DeepFake detec-
of DeepFake detection methods (e.g., [6, 17, 35, 36]) are tion methods, we construct the Celeb-DF dataset. A com-
data-driven, which directly employ various types of DNNs parison of the Celeb-DF dataset with other existing Deep-
trained on real and DeepFake videos, not relying on any Fake datasets is summarized in Table 1.
specific artifact.
3.1. Basic Information
2.3. Existing DeepFake Datasets
The Celeb-DF dataset is comprised of 590 real videos
DeepFake detection methods require training data and and 5, 639 DeepFake videos (corresponding to over two
need to be evaluated. As such, there is an increasing need million video frames). The average length of all videos is
for large-scale DeepFake video datasets. Table 1 lists the approximate 13 seconds with the standard frame rate of 30
current DeepFake datasets. frame-per-second. The real videos are chosen from publicly
UADFV: The UADFV dataset [53] contains 49 real available YouTube videos, corresponding to interviews of
YouTube and 49 DeepFake videos. The DeepFake videos 59 celebrities with a diverse distribution in their genders,
are generated using the DNN model with FakeAPP [5]. ages, and ethnic groups5 . 56.8% subjects in the real videos
DF-TIMIT: The DeepFake-TIMIT dataset [25] includes are male, and 43.2% are female. 8.5% are of age 60 and
640 DeepFake videos generated with faceswap-GAN [3] above, 30.5% are between 50 - 60, 26.6% are 40s, 28.0%
and based on the Vid-TIMIT dataset [43]. The videos are are 30s, and 6.4% are younger than 30. 5.1% are Asians,
divided into two equal-sized subsets: DF-TIMIT-LQ and 6.8% are African Americans and 88.1% are Caucasians. In
DF-TIMIT-HQ, with synthesized faces of size 64 × 64 and addition, the real videos exhibit large range of changes in
128 × 128 pixels, respectively. aspects such as the subjects’ face sizes (in pixels), orienta-
FF-DF: The FaceForensics++ dataset [40] includes a sub- tions, lighting conditions, and backgrounds. The DeepFake
set of DeepFakes videos, which has 1, 000 real YouTube videos are generated by swapping faces for each pair of the
videos and the same number of synthetic videos generated 59 subjects. The final videos are in MPEG4.0 format.
using faceswap [4].
DFD: The Google/Jigsaw DeepFake detection dataset [15] 3.2. Synthesis Method
has 3, 068 DeepFake videos generated based on 363 origi- The DeepFake videos in Celeb-DF are generated using
nal videos of 28 consented individuals of various genders, an improved DeepFake synthesis algorithm, which is key to
ages and ethnic groups. The details of the synthesis algo- the improved visual quality as shown in Fig.2. Specifically,
rithm are not disclosed, but it is likely to be an improved the basic DeepFake maker algorithm is refined in several
implementation of the basic DeepFake maker algorithm. aspects targeting the following specific visual artifacts ob-
DFDC: The Facebook DeepFake detection challenge served in existing datasets.
dataset [14] is part of the DeepFake detection challenge, Low resolution of synthesized faces: The basic DeepFake
which has 4, 113 DeepFake videos created based on 1, 131 maker algorithm generate low-resolution faces (typically
original videos of 66 consented individuals of various gen- 64 × 64 or 128 × 128 pixels). We improve the resolution of
ders, ages and ethnic groups4 . This dataset is created us-
5 We choose celebrities’ faces as they are more familiar to the viewers
4 The full set of DFDC has not been released at the time of CVPR sub- so that any visual artifacts can be more readily identified. Furthermore,
mission, and information is based on the first round release in [14]. celebrities are anecdotally the main targets of DeepFake videos.
64 × 64 128 × 128 256 × 256 Visual parts of
original face

Boundary
artifacts

Boundary
artifacts

Facial landmarks
Interpolated points

Figure 4. Comparison of DeepFake frames with different sizes of (a) (b) (c)
the synthesized faces. Note the improved smoothness of the 256 × Figure 6. Mask generation in existing datasets (Top two rows) and
256 synthesized face, which is used in Celeb-DF. This figure is best Celeb-DF (3rd row). (a) warped synthesized face overlaying the
viewed in color. target’s face. (b) mask generation. (c) final synthesis result.
the synthesized face to 256 × 256 pixels. This is achieved leaves the boundaries of the mask visible. We improve the
by using encoder and decoder models with more layers and mask generation step for Celeb-DF. We first synthesize a
increased dimensions. We determine the structure empiri- face with more surrounding context, so as to completely
cally for a balance between increased training time and bet- cover the original facial parts after warping. We then cre-
ter synthesis result. The higher resolution of the synthesized ate a smoothness mask based on the landmarks on eyebrow
faces are of better visual quality and less affected by resiz- and interpolated points on cheeks and between lower lip and
ing and rotation operations in accommodating the input tar- chin. The difference in mask generation used in existing
get faces, Fig.4. datasets and Celeb-DF is highlighted in Fig.6 with an ex-
Color mismatch: Color mismatch between the synthesized ample.
donor’s face with the original target’s face in Celeb-DF
Temporal flickering: We reduce temporal flickering of
is significantly reduced by training data augmentation and
synthetic faces in the DeepFake videos by incorporating
post processing. Specifically, in each training epoch, we
temporal correlations among the detected face landmarks.
randomly perturb the colors of the training faces, which
Specifically, the temporal sequence of the face landmarks
forces the DNNs to synthesize an image containing the
are filtered using a Kalman smoothing algorithm to reduce
same color pattern with input image. We also apply a color
imprecise variations of landmarks in each frame.
transfer algorithm [38] between the synthesized donor face
and the input target face. Fig.5 shows an example of syn- 3.3. Visual Quality
thesized face without (left) and with (right) color correction.
The refinements to the synthesis algorithm improve the
visual qualities of the DeepFake videos in the Celeb-DF
dataset, as demonstrated in Fig.2. We would like have a
more quantitative evaluation of the improvement in visual
quality of the DeepFake videos in Celeb-DF and compare
with the previous DeepFake datasets. Ideally, a reference-
free face image quality metric is the best choice for this
purpose. However, unfortunately, to date there is no such
metric that is agreed upon and widely adopted.
Figure 5. DeepFake frames using synthesized face without (left) Instead, we follow the face in-painting work [45] and use
and with (right) color correction. Note the reduced color mis- the Mask-SSIM score [32] as a referenced quantitative met-
match between the synthesized face region and the other part of ric of visual quality of synthesized DeepFake video frames.
the face. Synthesis method with color correction is used to gener- Mask-SSIM corresponds to the SSIM score [52] between
ate Celeb-DF. This figure is best viewed in color. the head regions (including face and hair) of the DeepFake
video frame and the corresponding original video frame,
Inaccurate face masks: In previous datasets, the face i.e., the head region of the original target is the reference for
masks are either rectangular, which may not completely visual quality evaluation. As such, low Mask-SSIM score
cover the facial parts in the original video frame, or the may be due to inferior visual quality as well as changes of
convex hull of landmarks on eyebrow and lower lip, which the identity from the target to the donor. On the other hand,
DF-TIMIT
Datasets UADFV
LQ HQ
FF-DF DFD DFDC Celeb-DF tion4. Meso4 uses conventional convolutional layers,
Mask while MesoInception4 is based on the more sophisti-
-SSIM
0.82 0.80 0.80 0.81 0.88 0.84 0.92 cated Inception modules [49].
Table 2. Average Mask-SSIM scores of different DeepFake • HeadPose [53] detects DeepFake videos using the
datasets. Computing Mask-SSIM requires exact corresponding inconsistencies in the head poses of the synthesized
pairs of DeepFake synthesized frames and original video frames, videos, based on a SVM model on estimated 3D head
which is not the case for DFD and DFDC. For these two datasets,
orientations from each video. The SVM model in this
we calculate the Mask-SSIM on videos that we have exact corre-
method is trained on the UADFV dataset.
spondences, i.e., 311 videos in DFD and 2, 025 videos in DFDC.
• FWA [28] detects DeepFake videos using a ResNet-50
since we only compare frames from DeepFake videos, the [19] to expose the face warping artifacts introduced by
errors caused by identity changes are biased in a similar the resizing and interpolation operations in the basic
fashion to all compared datasets. Therefore, the numerical DeepFake maker algorithm. This model is trained on
values of Mask-SSIM may not be meaningful to evaluate self-collected face images.
the absolute visual quality of the synthesized faces, but the • VA [33] is a recent DeepFake detection method based
difference between Mask-SSIM reflects the difference in vi- on capturing visual artifacts in the eyes, teeth and facial
sual quality. contours of the synthesized faces. There are two vari-
The Mask-SSIM score takes value in the range of [0, 1] ants of this method: VA-MLP is based on a multilayer
with higher value corresponding to better image quality. Ta- feedforward neural network classifier, and VA-LogReg
ble 2 shows the average Mask-SSIM scores for all compared uses a simpler logistic regression model. These mod-
datasets, with Celeb-DF having the highest scores. This els are trained on unpublished dataset, of which real
confirms the visual observation that Celeb-DF has improved images are cropped from CelebA dataset [31] and the
visual quality, as shown in Fig.2. DeepFake videos are from YouTube.
• Xception [40] corresponds to a DeepFake detection
4. Evaluating DeepFake Detection Methods method based on the XceptionNet model [12] trained
on the FaceForensics++ dataset. There are three vari-
Using Celeb-DF and other existing DeepFake datasets,
ants of Xception, namely, Xception-raw, Xception-
we perform the most comprehensive performance evalua-
c23 and Xception-c40: Xception-raw are trained on
tion of DeepFake detection to date, with the largest number
raw videos, while Xception-c23 and Xception-c40 are
of DeepFake detection methods and datasets considered.
trained on H.264 videos with medium (23) and high
There are two purposes of this evaluation. First, using the
degrees (40) of compression, respectively.
average detection performance as an indicator of the chal-
• Multi-task [34] is another recent DeepFake detection
lenge levels of various DeepFake datasets, we further com-
method that uses a CNN model to simultaneously de-
pare Celeb-DF with existing DeepFake datasets. Further-
tect manipulated images and segment manipulated ar-
more, we survey the performance of the current DeepFake
eas as a multi-task learning problem. This model is
detection methods on a large diversity of DeepFake videos,
trained on the FaceForensics dataset [39].
in particular, the high-quality ones in Celeb-DF.
• Capsule [36] uses capsule structures [42] based on a
4.1. Compared DeepFake Detection Methods VGG19 [44] network as the backbone architecture for
DeepFake classification. This model is trained on the
We consider nine DeepFake detection methods in our FaceForensics++ dataset.
experiments. Because of the need to run each method on • DSP-FWA is a recently further improved method
the Celeb-DF dataset, we choose only those that have code based on FWA, which includes a spatial pyramid pool-
and the corresponding DNN-model publicly available or ing (SPP) module [18] to better handle the variations
obtained from the authors directly. in the resolutions of the original target faces. This
• Two-stream [54] uses a two-stream CNN to achieve method is trained on self-collected face images.
state-of-the-art performance in general-purpose im-
A concise summary of the underlying model, source code,
age forgery detection. The underlying CNN is the
and training datasets of the DeepFake detection methods
GoogLeNet InceptionV3 model [48] trained on the
considered in our experiments is given in Table 3.
SwapMe dataset [54]. We use it as a baseline to com-
pare other dedicated DeepFake detection methods. 4.2. Experimental Settings
• MesoNet [6] is a CNN-based DeepFake detection
method targeting on the mesoscopic properties of im- We evaluate the overall detection performance using the
ages. The model is trained on unpublished DeepFake area under ROC curve (AUC) score at the frame level for
datasets collected by the authors. We evaluate two all key frames. There are several reasons for this choice.
variants of MesoNet, namely, Meso4 and MesoIncep- First, all compared methods analyze individual frames (usu-
Methods Model Type Training Dataset Repositories Release Date
Two-stream [54] GoogLeNet InceptionV3 [48] SwapMe [54] Unpublished code provided by the authors 2018.03
MesoNet [6] Designed CNN Unpublished https://github.com/DariusAf/MesoNet 2018.09
HeadPose [53] SVM UADFV [53] https://bitbucket.org/ericyang3721/headpose_forensic/ 2018.11
FWA [28] ResNet-50 [19] Unpublished https://github.com/danmohaha/CVPRW2019_Face_Artifacts 2018.11
VA-MLP [33] Designed CNN
Unpublished https://github.com/FalkoMatern/Exploiting-Visual-Artifacts 2019.01
VA-LogReg [33] Logistic Regression Model
Xception [40] XceptionNet [12] FaceForensics++ [40] https://github.com/ondyari/FaceForensics 2019.01
Multi-task [34] Designed CNN FaceForensics [39] https://github.com/nii-yamagishilab/ClassNSeg 2019.06
Capsule [36] Designed CapsuleNet [42] FaceForensics++ https://github.com/nii-yamagishilab/Capsule-Forensics-v2 2019.10
DSP-FWA SPPNet [18] Unpublished https://github.com/danmohaha/DSP-FWA 2019.11
Table 3. Summary of compared DeepFake detection methods. See texts for more details.

UADFV 80.2 Two-stream 68.6


DF-TIMIT-LQ 78.0 Meso4 75.9
DF-TIMIT-HQ 72.2 MesoInception4 73.0
HeadPose 58.7
FF-DF 82.3 FWA 82.1
DFD 68.2 VA-MLP 63.7
DFDC 64.7 VA-LogReg 69.3
Celeb-DF 56.9 Xception-raw 63.3
Xception-c23 86.4
50 60 70 80 Xception-c40 75.2
Average AUC Multi-task 60.2
Figure 7. Average AUC performance of all detection methods on Capsule 69.4
each dataset. DSP-FWA 87.4
55 60 65 70 75 80 85 90
ally key frames of a video) and output a classification score Average AUC
for each frame. Using frame-level AUC thus avoids differ- Figure 8. Average AUC performance of each detection method on
ences caused by different approaches to aggregating frame- all evaluated datasets.
level scores for each video. Second, using frame level AUC
score obviates the necessity of calibrating the classification is clearly higher for the second generation datasets (DFD,
outputs of these methods across different datasets. To in- DFDC, and Celeb-DF, with average AUC scores lower than
crease robustness to numerical imprecision, the classifica- 70%), while some detection methods achieve near perfect
tion scores are rounded to five digits after the decimal point, detection on the first generation datasets (UADFV, DF-
i.e., with a precision of 10−5 . As the videos are compressed, TIMIT, and FF-DF, with average AUC scores around 80%).
we perform evaluations only on the key frames.
we compare performance of each detection method us-
ing the inference code and the published pre-trained mod- In term of individual detection methods, Fig.8 shows the
els. This is because most of these methods do not have pub- comparison of average AUC score of each detection method
lished code for training the machine learning models. As on all DeepFake datasets. These results show that detec-
such, we could not practically re-train these models on all tion has also made progress with the most recent DSP-FWA
datasets we considered. We use the default parameters pro- method achieves the overall top performance (87.4%).
vided with each compared detection method.
As online videos are usually recompressed to different
4.3. Results and Analysis formats (MPEG4.0 and H264) and in different qualities dur-
ing the process of uploading and redistribution, it is also
In Table 4 we list individual frame-level AUC scores of important to evaluate the robustness of detection perfor-
all compared DeepFake detection methods over all datasets mance with regards to video compression. Table 5 shows
including Celeb-DF, and Fig.9 shows the frame-level ROC the average frame-level AUC scores of four state-of-the-art
curves of several top detection methods on several datasets. DeepFake detection methods on original MPEG4.0 videos,
Comparing different datasets, in Fig.7, we show the av- and medium (23), and high (40) degrees of H.264 com-
erage frame-level AUC scores of all compared detection pressed videos of Celeb-DF, respectively. The results show
methods on each dataset. Celeb-DF is in general the most that the performance of each method is reduced along with
challenging to the current detection methods, and their over- the compression degree increased. In particular, the per-
all performance on Celeb-DF is lowest across all datasets. formance of FWA and DSP-FWA degrades significantly on
These results are consistent with the differences in visual recompressed video, while the performance of Xception-
quality. Note many current detection methods predicate on c23 and Xception-c40 is not significantly affected. This is
visual artifacts such as low resolution and color mismatch, expected because the latter methods were trained on com-
which are improved in synthesis algorithm for the Celeb- pressed H.264 videos such that they are more robust in this
DF dataset. Furthermore, the difficulty level for detection setting.
DF-TIMIT [25]
Methods↓ Datasets→ UADFV [53] FF-DF [40] DFD [15] DFDC [14] Celeb-DF
LQ HQ
Two-stream [54] 85.1 83.5 73.5 70.1 52.8 61.4 53.8
Meso4 [6] 84.3 87.8 68.4 84.7 76.0 75.3 54.8
MesoInception4 82.1 80.4 62.7 83.0 75.9 73.2 53.6
HeadPose [53] 89.0 55.1 53.2 47.3 56.1 55.9 54.6
FWA [28] 97.4 99.9 93.2 80.1 74.3 72.7 56.9
VA-MLP [33] 70.2 61.4 62.1 66.4 69.1 61.9 55.0
VA-LogReg 54.0 77.0 77.3 78.0 77.2 66.2 55.1
Xception-raw [40] 80.4 56.7 54.0 99.7 53.9 49.9 48.2
Xception-c23 91.2 95.9 94.4 99.7 85.9 72.2 65.3
Xception-c40 83.6 75.8 70.5 95.5 65.8 69.7 65.5
Multi-task [34] 65.8 62.2 55.3 76.3 54.1 53.6 54.3
Capsule [36] 61.3 78.4 74.4 96.6 64.0 53.3 57.5
DSP-FWA 97.7 99.9 99.7 93.0 81.1 75.5 64.6
Table 4. Frame-level AUC scores (%) of various methods on compared datasets. Bold faces correspond to the top performance.

FWA Meso4 MesoInception4


1.00 1.00 1.00
0.75 0.75 0.75
0.50 FF-DF (80.1) 0.50 FF-DF (84.7) 0.50 FF-DF (83.0)
DFD (74.3) DFD (76.0) DFD (75.9)
0.25 DFDC (72.7) 0.25 DFDC (75.3) 0.25 DFDC (73.2)
Celeb-DF (56.9) Celeb-DF (54.8) Celeb-DF (53.6)
0.00 0.00 0.00
0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0
Xception-c23 Xception-c40 DSP-FWA
1.00 1.00 1.00
0.75 0.75 0.75
0.50 FF-DF (99.7) 0.50 FF-DF (95.5) 0.50 FF-DF (93.0)
DFD (85.9) DFD (65.8) DFD (81.1)
0.25 DFDC (72.2) 0.25 DFDC (69.7) 0.25 DFDC (75.5)
Celeb-DF (65.3) Celeb-DF (65.5) Celeb-DF (64.6)
0.00 0.00 0.00
0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0
Figure 9. ROC curves of six state-of-the-art detection methods (FWA, Meso4, MesoInception4, Xception-c23, Xception-40 and DSP-FWA)
on four largest datasets (FF-DF, DFD, DFDC and Celeb-DF).

Original c23 c40 room for improvement.


FWA 56.9 54.6 52.2 For future works, the foremost task is to enlarge the
Xception-c23 65.3 65.5 52.5 Celeb-DF dataset and improve the visual quality of the syn-
Xception-c40 65.5 65.4 59.4 thesized videos. This entails improving the running effi-
DSP-FWA 64.6 57.7 47.2 ciency and model structure of the current synthesis algo-
Table 5. AUC performance of four top detection methods on orig- rithm. Furthermore, while the forgers can improve the vi-
inal, medium (23) and high (40) degrees of H.264 compressed sual quality in general, they may also adopt anti-forensic
Celeb-DF respectively.
techniques, which aim to hide traces of DeepFake synthe-
5. Conclusion sis on which the detection methods predicate. Anticipating
such counter-measures at the forgers’ disposal, we aim to
We present a new challenging large-scale dataset for the incorporate anti-forensic techniques in Celeb-DF.
development and evaluation of DeepFake detection meth- Acknowledgement. This material is based upon work sup-
ods. The Celeb-DF dataset reduces the gap in visual quality ported by NSF under Grant No (IIS-1816227). Any opinions,
of DeepFake datasets and the actual DeepFake videos cir- findings, and conclusions or recommendations expressed in this
culated online. Based on the Celeb-DF dataset, we perform material are those of the author(s) and do not necessarily reflect
a comprehensive performance evaluation of current Deep- the views of NSF.
Fake detection methods, and show that there is still much
References [20] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen.
Progressive growing of GANs for improved quality, stability,
[1] DeepFaceLab github. https://github.com/ and variation. In ICLR, 2018.
iperov/DeepFaceLab, Accessed Nov 4, 2019.
[21] Tero Karras, Samuli Laine, and Timo Aila. A style-based
[2] DFaker github. https://github.com/dfaker/df,
generator architecture for generative adversarial networks. In
Accessed Nov 4, 2019.
CVPR, 2019.
[3] faceswap-GAN github. https://github.com/
[22] Vahid Kazemi and Josephine Sullivan. One millisecond face
shaoanlu/faceswap-GAN, Accessed Nov 4, 2019.
alignment with an ensemble of regression trees. In CVPR,
[4] faceswap github. https://github.com/ 2014.
deepfakes/faceswap, Accessed Nov 4, 2019.
[23] H. Kim, P. Garrido, A. Tewari, W. Xu, J. Thies, N. Nießner,
[5] FakeApp. https://www.malavida.com/en/soft/
P. Pérez, C. Richardt, M. Zollhöfer, and C. Theobalt. Deep
fakeapp/, Acessed Nov 4, 2019.
Video Portraits. ACM Transactions on Graphics 2018
[6] Darius Afchar, Vincent Nozick, Junichi Yamagishi, and Isao (TOG), 2018.
Echizen. Mesonet: a compact facial video forgery detection
[24] Diederik P Kingma and Max Welling. Auto-encoding varia-
network. In IEEE International Workshop on Information
tional bayes. In ICLR, 2014.
Forensics and Security (WIFS), 2018.
[25] Pavel Korshunov and Sébastien Marcel. Deepfakes: a new
[7] Shruti Agarwal, Hany Farid, Yuming Gu, Mingming He,
threat to face recognition? assessment and detection. arXiv
Koki Nagano, and Hao Li. Protecting world leaders against
preprint arXiv:1812.08685, 2018.
deep fakes. In IEEE Conference on Computer Vision and
Pattern Recognition Workshops (CVPRW), 2019. [26] Iryna Korshunova, Wenzhe Shi, Joni Dambre, and Lucas
Theis. Fast face-swap using convolutional neural networks.
[8] Jawadul H Bappy, Cody Simons, Lakshmanan Nataraj, BS
In ICCV, 2017.
Manjunath, and Amit K Roy-Chowdhury. Hybrid lstm and
encoder-decoder architecture for detection of image forg- [27] Yuezun Li, Ming-Ching Chang, and Siwei Lyu. In ictu oculi:
eries. IEEE Transactions on Image Processing (TIP), 2019. Exposing AI generated fake face videos by detecting eye
[9] Dmitri Bitouk, Neeraj Kumar, Samreen Dhillon, Peter Bel- blinking. In IEEE International Workshop on Information
humeur, and Shree K Nayar. Face swapping: automati- Forensics and Security (WIFS), 2018.
cally replacing faces in photographs. ACM Transactions on [28] Yuezun Li and Siwei Lyu. Exposing deepfake videos by de-
Graphics (TOG), 2008. tecting face warping artifacts. In IEEE Conference on Com-
[10] Caroline Chan, Shiry Ginosar, Tinghui Zhou, and Alexei A puter Vision and Pattern Recognition Workshops (CVPRW),
Efros. Everybody dance now. In ICCV, 2019. 2019.
[11] Robert Chesney and Danielle Keats Citron. Deep Fakes: A [29] Ming-Yu Liu, Thomas Breuel, and Jan Kautz. Unsupervised
Looming Challenge for Privacy, Democracy, and National image-to-image translation networks. In NeurIPS, 2017.
Security. 107 California Law Review (2019, Forthcoming); [30] Yaqi Liu, Qingxiao Guan, Xianfeng Zhao, and Yun Cao. Im-
U of Texas Law, Public Law Research Paper No. 692; U of age forgery localization based on multi-scale convolutional
Maryland Legal Studies Research Paper No. 2018-21, 2018. neural networks. In ACM Workshop on Information Hiding
[12] François Chollet. Xception: Deep learning with depthwise and Multimedia Security (IHMMSec), 2018.
separable convolutions. In CVPR, 2017. [31] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang.
[13] Kevin Dale, Kalyan Sunkavalli, Micah K Johnson, Daniel Deep learning face attributes in the wild. In ICCV, 2015.
Vlasic, Wojciech Matusik, and Hanspeter Pfister. Video face [32] Liqian Ma, Xu Jia, Qianru Sun, Bernt Schiele, Tinne Tuyte-
replacement. ACM Transactions on Graphics (TOG), 2011. laars, and Luc Van Gool. Pose guided person image genera-
[14] Brian Dolhansky, Russ Howes, Ben Pflaum, Nicole tion. In NeurIPS, 2017.
Baram, and Cristian Canton Ferrer. The deepfake detec- [33] Falko Matern, Christian Riess, and Marc Stamminger. Ex-
tion challenge (DFDC) preview dataset. arXiv preprint ploiting visual artifacts to expose deepfakes and face manip-
arXiv:1910.08854, 2019. ulations. In IEEE Winter Applications of Computer Vision
[15] Nicholas Dufour, Andrew Gully, Per Karlsson, Alexey Vic- Workshops (WACVW), 2019.
tor Vorbyov, Thomas Leung, Jeremiah Childs, and Christoph [34] Huy H Nguyen, Fuming Fang, Junichi Yamagishi, and Isao
Bregler. Deepfakes detection dataset by google & jigsaw. Echizen. Multi-task learning for detecting and segmenting
[16] Hany Farid. Digital Image Forensics. MIT Press, 2012. manipulated facial images and videos. In IEEE International
[17] David Güera and Edward J Delp. Deepfake video detection Conference on Biometrics: Theory, Applications and Sys-
using recurrent neural networks. In AVSS, 2018. tems (BTAS), 2019.
[18] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. [35] Huy H Nguyen, Junichi Yamagishi, and Isao Echizen.
Spatial pyramid pooling in deep convolutional networks for Capsule-forensics: Using capsule networks to detect forged
visual recognition. IEEE transactions on pattern analysis images and videos. In IEEE International Conference on
and machine intelligence (TPAMI), 2015. Acoustics, Speech and Signal Processing (ICASSP), 2019.
[19] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. [36] Huy H Nguyen, Junichi Yamagishi, and Isao Echizen. Use
Deep residual learning for image recognition. In CVPR, of a capsule network to detect fake images and videos. arXiv
2016. preprint arXiv:1910.12467, 2019.
[37] Hai X Pham, Yuting Wang, and Vladimir Pavlovic. Gen- [53] Xin Yang, Yuezun Li, and Siwei Lyu. Exposing deep fakes
erative adversarial talking head: Bringing portraits to life using inconsistent head poses. In IEEE International Confer-
with a weakly supervised neural network. arXiv preprint ence on Acoustics, Speech and Signal Processing (ICASSP),
arXiv:1803.07716, 2018. 2019.
[38] Erik Reinhard, Michael Adhikhmin, Bruce Gooch, and Peter [54] Peng Zhou, Xintong Han, Vlad I Morariu, and Larry S Davis.
Shirley. Color transfer between images. IEEE Computer Two-stream neural networks for tampered face detection. In
graphics and applications, 2001. IEEE Conference on Computer Vision and Pattern Recogni-
[39] Andreas Rössler, Davide Cozzolino, Luisa Verdoliva, Chris- tion Workshops (CVPRW), 2017.
tian Riess, Justus Thies, and Matthias Nießner. Faceforen- [55] Peng Zhou, Xintong Han, Vlad I Morariu, and Larry S Davis.
sics: A large-scale video dataset for forgery detection in hu- Learning rich features for image manipulation detection. In
man faces. arXiv preprint arXiv:1803.09179, 2018. CVPR, 2018.
[40] Andreas Rössler, Davide Cozzolino, Luisa Verdoliva, Chris-
tian Riess, Justus Thies, and Matthias Nießner. FaceForen-
sics++: Learning to detect manipulated facial images. In
ICCV, 2019.
[41] Ekraam Sabir, Jiaxin Cheng, Ayush Jaiswal, Wael AbdAl-
mageed, Iacopo Masi, and Prem Natarajan. Recurrent-
convolution approach to deepfake detection-state-of-art re-
sults on faceforensics++. arXiv preprint arXiv:1905.00582,
2019.
[42] Sara Sabour, Nicholas Frosst, and Geoffrey E Hinton. Dy-
namic routing between capsules. In NeurIPS, 2017.
[43] Conrad Sanderson and Brian C Lovell. Multi-region proba-
bilistic histograms for robust and scalable identity inference.
In International Conference on Biometrics, 2009.
[44] Karen Simonyan and Andrew Zisserman. Very deep convo-
lutional networks for large-scale image recognition. arXiv
preprint arXiv:1409.1556, 2014.
[45] Qianru Sun, Liqian Ma, Seong Joon Oh, Luc Van Gool,
Bernt Schiele, and Mario Fritz. Natural and effective ob-
fuscation by head inpainting. In CVPR, 2018.
[46] Supasorn Suwajanakorn, Steven M Seitz, and Ira
Kemelmacher-Shlizerman. What makes tom hanks
look like tom hanks. In ICCV, 2015.
[47] Supasorn Suwajanakorn, Steven M Seitz, and Ira
Kemelmacher-Shlizerman. Synthesizing obama: learn-
ing lip sync from audio. ACM Transactions on Graphics
(TOG), 2017.
[48] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet,
Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent
Vanhoucke, and Andrew Rabinovich. Going deeper with
convolutions. In CVPR, 2015.
[49] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet,
Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent
Vanhoucke, and Andrew Rabinovich. Going deeper with
convolutions. In CVPR, 2015.
[50] Justus Thies, Michael Zollhöfer, and Matthias Nießner. De-
ferred neural rendering: Image synthesis using neural tex-
tures. In SIGGRAPH, 2019.
[51] Justus Thies, Michael Zollhofer, Marc Stamminger, Chris-
tian Theobalt, and Matthias Niessner. Face2face: Real-time
face capture and reenactment of rgb videos. In CVPR, June
2016.
[52] Zhou Wang, Alan C Bovik, Hamid R Sheikh, Eero P Simon-
celli, et al. Image quality assessment: from error visibility to
structural similarity. IEEE Transactions on Image Process-
ing (TIP), 2004.

You might also like