Celeb-Df: A Large-Scale Challenging Dataset For Deepfake Forensics
Celeb-Df: A Large-Scale Challenging Dataset For Deepfake Forensics
Celeb-Df: A Large-Scale Challenging Dataset For Deepfake Forensics
Yuezun Li1 , Xin Yang1 , Pu Sun2 , Honggang Qi2 and Siwei Lyu1
1
University at Albany, State University of New York, USA
2
University of Chinese Academy of Sciences, China
arXiv:1909.12962v4 [cs.CR] 16 Mar 2020
(d)
(g)
(e)
(f) Encoder Encoder Shared Encoder
(𝐸) (𝐸) (𝐸)
Face detection Landmark extraction Face alignment
L1 loss
(g) (g)
Decoder Decoder Decoder
(𝐷1 ) (𝐷2 ) (𝐷1 )
Figure 3. Synthesis (left) and training (right) of the basic DeepFake maker algorithm. See texts for more details.
Since synthesized faces are spliced into the original the physical/physiological aspects in the DeepFake videos.
video frames, state-of-the-art DNN splicing detection meth- The method in work of [27] exploits the observation that
ods, e.g., [54, 55, 30, 8], can be applied. There have many DeepFake videos lack reasonable eye blinking due
also been algorithms dedicated to the detection of Deep- to the use of online portraits as training data, which usu-
Fake videos that fall into three categories. Methods in ally do not have closed eyes for aesthetic reasons. Incoher-
the first category are based on inconsistencies exhibited in ent head poses in DeepFake videos are utilized in [53] to
# Real # DeepFake ing two different synthesis algorithms, but the details of the
Dataset Release Date
Video Frame Video Frame
synthesis algorithm are not disclosed.
UADFV 49 17.3k 49 17.3k 2018.11
DF-TIMIT-LQ ∗ 320 34.0k Based on release time and synthesis algorithms, we cat-
320 34.0k 2018.12
DF-TIMIT-HQ 320 34.0k egorize UADFV, DF-TIMIT, and FF-DF as the first genera-
FF-DF 1,000 509.9k 1,000 509.9k 2019.01 tion of DeepFake datasets, while DFD, DFDC, and the pro-
DFD 363 315.4k 3,068 2,242.7k 2019.09
posed Celeb-DF datasets are the second generation. In gen-
DFDC 1,131 488.4k 4,113 1,783.3k 2019.10
eral, the second generation datasets improve in both quan-
Celeb-DF 590 225.4k 5,639 2,116.8k 2019.11
Table 1. Basic information of various DeepFake video datasets. ∗: tity and quality over the first generation.
the original videos in DF-TIMIT are from Vid-TIMIT dataset.
expose DeepFake videos. In [7], the idiosyncratic behav- 3. The Celeb-DF Dataset
ioral patterns of a particular individual are captured by the Although the current DeepFake datasets have sufficient
time series of facial landmarks extracted from real videos number of videos, as discussed in the Introduction and
are used to spot DeepFake videos. The second category of demonstrate in Fig.1, DeepFake videos in these datasets
DeepFake detection algorithms (e.g., [33, 28]) use signal- have various visual artifacts that easily distinguish them
level artifacts introduced during the synthesis process such from the real videos. To provide more relevant data to eval-
as those described in the Introduction. The third category uate and support the future development DeepFake detec-
of DeepFake detection methods (e.g., [6, 17, 35, 36]) are tion methods, we construct the Celeb-DF dataset. A com-
data-driven, which directly employ various types of DNNs parison of the Celeb-DF dataset with other existing Deep-
trained on real and DeepFake videos, not relying on any Fake datasets is summarized in Table 1.
specific artifact.
3.1. Basic Information
2.3. Existing DeepFake Datasets
The Celeb-DF dataset is comprised of 590 real videos
DeepFake detection methods require training data and and 5, 639 DeepFake videos (corresponding to over two
need to be evaluated. As such, there is an increasing need million video frames). The average length of all videos is
for large-scale DeepFake video datasets. Table 1 lists the approximate 13 seconds with the standard frame rate of 30
current DeepFake datasets. frame-per-second. The real videos are chosen from publicly
UADFV: The UADFV dataset [53] contains 49 real available YouTube videos, corresponding to interviews of
YouTube and 49 DeepFake videos. The DeepFake videos 59 celebrities with a diverse distribution in their genders,
are generated using the DNN model with FakeAPP [5]. ages, and ethnic groups5 . 56.8% subjects in the real videos
DF-TIMIT: The DeepFake-TIMIT dataset [25] includes are male, and 43.2% are female. 8.5% are of age 60 and
640 DeepFake videos generated with faceswap-GAN [3] above, 30.5% are between 50 - 60, 26.6% are 40s, 28.0%
and based on the Vid-TIMIT dataset [43]. The videos are are 30s, and 6.4% are younger than 30. 5.1% are Asians,
divided into two equal-sized subsets: DF-TIMIT-LQ and 6.8% are African Americans and 88.1% are Caucasians. In
DF-TIMIT-HQ, with synthesized faces of size 64 × 64 and addition, the real videos exhibit large range of changes in
128 × 128 pixels, respectively. aspects such as the subjects’ face sizes (in pixels), orienta-
FF-DF: The FaceForensics++ dataset [40] includes a sub- tions, lighting conditions, and backgrounds. The DeepFake
set of DeepFakes videos, which has 1, 000 real YouTube videos are generated by swapping faces for each pair of the
videos and the same number of synthetic videos generated 59 subjects. The final videos are in MPEG4.0 format.
using faceswap [4].
DFD: The Google/Jigsaw DeepFake detection dataset [15] 3.2. Synthesis Method
has 3, 068 DeepFake videos generated based on 363 origi- The DeepFake videos in Celeb-DF are generated using
nal videos of 28 consented individuals of various genders, an improved DeepFake synthesis algorithm, which is key to
ages and ethnic groups. The details of the synthesis algo- the improved visual quality as shown in Fig.2. Specifically,
rithm are not disclosed, but it is likely to be an improved the basic DeepFake maker algorithm is refined in several
implementation of the basic DeepFake maker algorithm. aspects targeting the following specific visual artifacts ob-
DFDC: The Facebook DeepFake detection challenge served in existing datasets.
dataset [14] is part of the DeepFake detection challenge, Low resolution of synthesized faces: The basic DeepFake
which has 4, 113 DeepFake videos created based on 1, 131 maker algorithm generate low-resolution faces (typically
original videos of 66 consented individuals of various gen- 64 × 64 or 128 × 128 pixels). We improve the resolution of
ders, ages and ethnic groups4 . This dataset is created us-
5 We choose celebrities’ faces as they are more familiar to the viewers
4 The full set of DFDC has not been released at the time of CVPR sub- so that any visual artifacts can be more readily identified. Furthermore,
mission, and information is based on the first round release in [14]. celebrities are anecdotally the main targets of DeepFake videos.
64 × 64 128 × 128 256 × 256 Visual parts of
original face
Boundary
artifacts
Boundary
artifacts
Facial landmarks
Interpolated points
Figure 4. Comparison of DeepFake frames with different sizes of (a) (b) (c)
the synthesized faces. Note the improved smoothness of the 256 × Figure 6. Mask generation in existing datasets (Top two rows) and
256 synthesized face, which is used in Celeb-DF. This figure is best Celeb-DF (3rd row). (a) warped synthesized face overlaying the
viewed in color. target’s face. (b) mask generation. (c) final synthesis result.
the synthesized face to 256 × 256 pixels. This is achieved leaves the boundaries of the mask visible. We improve the
by using encoder and decoder models with more layers and mask generation step for Celeb-DF. We first synthesize a
increased dimensions. We determine the structure empiri- face with more surrounding context, so as to completely
cally for a balance between increased training time and bet- cover the original facial parts after warping. We then cre-
ter synthesis result. The higher resolution of the synthesized ate a smoothness mask based on the landmarks on eyebrow
faces are of better visual quality and less affected by resiz- and interpolated points on cheeks and between lower lip and
ing and rotation operations in accommodating the input tar- chin. The difference in mask generation used in existing
get faces, Fig.4. datasets and Celeb-DF is highlighted in Fig.6 with an ex-
Color mismatch: Color mismatch between the synthesized ample.
donor’s face with the original target’s face in Celeb-DF
Temporal flickering: We reduce temporal flickering of
is significantly reduced by training data augmentation and
synthetic faces in the DeepFake videos by incorporating
post processing. Specifically, in each training epoch, we
temporal correlations among the detected face landmarks.
randomly perturb the colors of the training faces, which
Specifically, the temporal sequence of the face landmarks
forces the DNNs to synthesize an image containing the
are filtered using a Kalman smoothing algorithm to reduce
same color pattern with input image. We also apply a color
imprecise variations of landmarks in each frame.
transfer algorithm [38] between the synthesized donor face
and the input target face. Fig.5 shows an example of syn- 3.3. Visual Quality
thesized face without (left) and with (right) color correction.
The refinements to the synthesis algorithm improve the
visual qualities of the DeepFake videos in the Celeb-DF
dataset, as demonstrated in Fig.2. We would like have a
more quantitative evaluation of the improvement in visual
quality of the DeepFake videos in Celeb-DF and compare
with the previous DeepFake datasets. Ideally, a reference-
free face image quality metric is the best choice for this
purpose. However, unfortunately, to date there is no such
metric that is agreed upon and widely adopted.
Figure 5. DeepFake frames using synthesized face without (left) Instead, we follow the face in-painting work [45] and use
and with (right) color correction. Note the reduced color mis- the Mask-SSIM score [32] as a referenced quantitative met-
match between the synthesized face region and the other part of ric of visual quality of synthesized DeepFake video frames.
the face. Synthesis method with color correction is used to gener- Mask-SSIM corresponds to the SSIM score [52] between
ate Celeb-DF. This figure is best viewed in color. the head regions (including face and hair) of the DeepFake
video frame and the corresponding original video frame,
Inaccurate face masks: In previous datasets, the face i.e., the head region of the original target is the reference for
masks are either rectangular, which may not completely visual quality evaluation. As such, low Mask-SSIM score
cover the facial parts in the original video frame, or the may be due to inferior visual quality as well as changes of
convex hull of landmarks on eyebrow and lower lip, which the identity from the target to the donor. On the other hand,
DF-TIMIT
Datasets UADFV
LQ HQ
FF-DF DFD DFDC Celeb-DF tion4. Meso4 uses conventional convolutional layers,
Mask while MesoInception4 is based on the more sophisti-
-SSIM
0.82 0.80 0.80 0.81 0.88 0.84 0.92 cated Inception modules [49].
Table 2. Average Mask-SSIM scores of different DeepFake • HeadPose [53] detects DeepFake videos using the
datasets. Computing Mask-SSIM requires exact corresponding inconsistencies in the head poses of the synthesized
pairs of DeepFake synthesized frames and original video frames, videos, based on a SVM model on estimated 3D head
which is not the case for DFD and DFDC. For these two datasets,
orientations from each video. The SVM model in this
we calculate the Mask-SSIM on videos that we have exact corre-
method is trained on the UADFV dataset.
spondences, i.e., 311 videos in DFD and 2, 025 videos in DFDC.
• FWA [28] detects DeepFake videos using a ResNet-50
since we only compare frames from DeepFake videos, the [19] to expose the face warping artifacts introduced by
errors caused by identity changes are biased in a similar the resizing and interpolation operations in the basic
fashion to all compared datasets. Therefore, the numerical DeepFake maker algorithm. This model is trained on
values of Mask-SSIM may not be meaningful to evaluate self-collected face images.
the absolute visual quality of the synthesized faces, but the • VA [33] is a recent DeepFake detection method based
difference between Mask-SSIM reflects the difference in vi- on capturing visual artifacts in the eyes, teeth and facial
sual quality. contours of the synthesized faces. There are two vari-
The Mask-SSIM score takes value in the range of [0, 1] ants of this method: VA-MLP is based on a multilayer
with higher value corresponding to better image quality. Ta- feedforward neural network classifier, and VA-LogReg
ble 2 shows the average Mask-SSIM scores for all compared uses a simpler logistic regression model. These mod-
datasets, with Celeb-DF having the highest scores. This els are trained on unpublished dataset, of which real
confirms the visual observation that Celeb-DF has improved images are cropped from CelebA dataset [31] and the
visual quality, as shown in Fig.2. DeepFake videos are from YouTube.
• Xception [40] corresponds to a DeepFake detection
4. Evaluating DeepFake Detection Methods method based on the XceptionNet model [12] trained
on the FaceForensics++ dataset. There are three vari-
Using Celeb-DF and other existing DeepFake datasets,
ants of Xception, namely, Xception-raw, Xception-
we perform the most comprehensive performance evalua-
c23 and Xception-c40: Xception-raw are trained on
tion of DeepFake detection to date, with the largest number
raw videos, while Xception-c23 and Xception-c40 are
of DeepFake detection methods and datasets considered.
trained on H.264 videos with medium (23) and high
There are two purposes of this evaluation. First, using the
degrees (40) of compression, respectively.
average detection performance as an indicator of the chal-
• Multi-task [34] is another recent DeepFake detection
lenge levels of various DeepFake datasets, we further com-
method that uses a CNN model to simultaneously de-
pare Celeb-DF with existing DeepFake datasets. Further-
tect manipulated images and segment manipulated ar-
more, we survey the performance of the current DeepFake
eas as a multi-task learning problem. This model is
detection methods on a large diversity of DeepFake videos,
trained on the FaceForensics dataset [39].
in particular, the high-quality ones in Celeb-DF.
• Capsule [36] uses capsule structures [42] based on a
4.1. Compared DeepFake Detection Methods VGG19 [44] network as the backbone architecture for
DeepFake classification. This model is trained on the
We consider nine DeepFake detection methods in our FaceForensics++ dataset.
experiments. Because of the need to run each method on • DSP-FWA is a recently further improved method
the Celeb-DF dataset, we choose only those that have code based on FWA, which includes a spatial pyramid pool-
and the corresponding DNN-model publicly available or ing (SPP) module [18] to better handle the variations
obtained from the authors directly. in the resolutions of the original target faces. This
• Two-stream [54] uses a two-stream CNN to achieve method is trained on self-collected face images.
state-of-the-art performance in general-purpose im-
A concise summary of the underlying model, source code,
age forgery detection. The underlying CNN is the
and training datasets of the DeepFake detection methods
GoogLeNet InceptionV3 model [48] trained on the
considered in our experiments is given in Table 3.
SwapMe dataset [54]. We use it as a baseline to com-
pare other dedicated DeepFake detection methods. 4.2. Experimental Settings
• MesoNet [6] is a CNN-based DeepFake detection
method targeting on the mesoscopic properties of im- We evaluate the overall detection performance using the
ages. The model is trained on unpublished DeepFake area under ROC curve (AUC) score at the frame level for
datasets collected by the authors. We evaluate two all key frames. There are several reasons for this choice.
variants of MesoNet, namely, Meso4 and MesoIncep- First, all compared methods analyze individual frames (usu-
Methods Model Type Training Dataset Repositories Release Date
Two-stream [54] GoogLeNet InceptionV3 [48] SwapMe [54] Unpublished code provided by the authors 2018.03
MesoNet [6] Designed CNN Unpublished https://github.com/DariusAf/MesoNet 2018.09
HeadPose [53] SVM UADFV [53] https://bitbucket.org/ericyang3721/headpose_forensic/ 2018.11
FWA [28] ResNet-50 [19] Unpublished https://github.com/danmohaha/CVPRW2019_Face_Artifacts 2018.11
VA-MLP [33] Designed CNN
Unpublished https://github.com/FalkoMatern/Exploiting-Visual-Artifacts 2019.01
VA-LogReg [33] Logistic Regression Model
Xception [40] XceptionNet [12] FaceForensics++ [40] https://github.com/ondyari/FaceForensics 2019.01
Multi-task [34] Designed CNN FaceForensics [39] https://github.com/nii-yamagishilab/ClassNSeg 2019.06
Capsule [36] Designed CapsuleNet [42] FaceForensics++ https://github.com/nii-yamagishilab/Capsule-Forensics-v2 2019.10
DSP-FWA SPPNet [18] Unpublished https://github.com/danmohaha/DSP-FWA 2019.11
Table 3. Summary of compared DeepFake detection methods. See texts for more details.