2401.06506v3

FREQUENCY MASKING FOR UNIVERSAL DEEPFAKE DETECTION
Chandler Timm Doloriel Ngai-Man Cheung∗
Singapore University of Technology and Design (SUTD)
ABSTRACT al. [8] focused on forensic features for universal deepfake

We study universal deepfake detection. Our goal is to detect detection. More recent work, as demonstrated by Ojha et
arXiv:2401.06506v3 [cs.CV] 17 Jan 2024
synthetic images from a range of generative AI approaches, al. [1], involve utilizing the feature space of a large pretrained
particularly from emerging ones which are unseen during model for this purpose. However, this method hinges on the
training of the deepfake detector. Universal deepfake de- availability of a model pretrained on a very large dataset.
tection requires outstanding generalization capability. Moti- Finally, Chen et al. [9] investigated the application of one-
vated by recently proposed masked image modeling which shot test-time training to enhance generalization in detection,
has demonstrated excellent generalization in self-supervised albeit at the cost of additional computational resources for
pre-training, we make the first attempt to explore masked each test sample.
image modeling for universal deepfake detection. We study Masked image modeling. Meanwhile, in self-supervised
spatial and frequency domain masking in training deepfake pre-training, masked image modeling has emerged in the past
detectors. Based on empirical analysis, we propose a novel year as a promising approach to improve generalization ca-
deepfake detector via frequency masking. Our focus on fre- pability [10, 11, 12, 13]. Specifically, in pre-training, the pre-
quency domain is different from the majority, which primarily text tasks learn to predict masked portions of the unlabeled
target spatial domain detection. Our comparative analyses re- data, and reconstruction loss is used as the learning objective.
veal substantial performance gains over existing methods. After pre-training, the pre-trained encoder can be effectively
Code and models are publicly available1 . adapted for various downstream tasks. The study presented
in [10] demonstrates that pre-training through a masked au-
Index Terms— deepfake, masked image modeling, gen- toencoder (MAE) empowers high-capacity models to attain
erative AI, GAN, diffusion models state-of-the-art (SOTA) generalization performance across
1. INTRODUCTION a multitude of downstream tasks. Furthermore, extensive
experiments outlined in [11] reveal that employing masking
The proliferation of increasingly convincing synthetic im- techniques can lead to SOTA results in out-of-distribution
ages, facilitated by generative AI, poses significant challenges detection.
across multiple sectors, including cybersecurity, digital foren- In our work, we make the first attempt to explore masked
sics, and public discourse [1, 2, 3, 4]. These AI-generated image modeling to improve generalization capability of deep-
images could be mis-used as deepfakes for malicious purpose, fake detector with the objective to advance universal deep-
e.g., disinformation. Detection of deepfakes is an important fake detection. Unlike traditional masked image modeling
problem that has attracted attention. which primarily uses reconstruction loss in self-supervised
Universal deepfake detection. Early efforts in deepfake pre-training, our method applies masking in a supervised set-
detection focus on identifying synthetic images generated by ting, focusing on classification loss for distinguishing real and
particular types of generative AI [5]. However, with rapid fake images. Our training involves both spatial and frequency
advancements (e.g. diffusion models [6]), there is an increas- domain masking on all images, as depicted in Figure 1. This
ing interest in studying universal deepfake detection capable technique, which obscures parts of the image, enhances the
of performing effectively for a range of generative AI ap- challenge of training. It aims to prevent the detector from de-
proaches, particularly for emerging ones which are unseen pending on superficial features and instead fosters the devel-
during training of the deepfake detector. Therefore, universal opment of robust, generalizable representations for effective
deepfake detection necessitates a significant generalization universal deepfake detection. Notably, masking is only em-
capability. Wang et al. [7] investigated post-processing and ployed during training, not in the testing phase.
data augmentation techniques for detecting synthetic images, In our study, we analyze both spatial and frequency mask-
primarily those generated by various Generative Adversarial ing for universal deepfake detection. Our results suggest that
Network (GAN) models. Subsequently, Chandrasegaran et frequency masking is more effective than spatial masking in
∗ Corresponding Author generalizing deepfake detection for different generative AI
1 https://github.com/chandlerbing65nm/FakeImageDetection.git approaches. Our finding is consistent with a recent study
details of these techniques, showing how they contribute to
improvement of universal deepfake detection.
2.1. Spatial Domain Masking
For spatial domain masking, we study two distinct methods:
Patch Masking and Pixel Masking. Patch Masking operates
by dividing an image of size H × W into non-overlapping
(a) Spatial Domain Masking patches of size p × p. The number of patches N can be cal-

culated: N = H × W /p2 . A ratio r is used to determine
the number of patches m that will be masked, calculated as
m = ⌈r × N ⌉. These masked patches are selected randomly,
and their pixel values are set to zero.
In Pixel Masking, each pixel is independently consid-
ered. Given an image of dimensions H ×W , the total number
of pixels is T = H ×W . A ratio r specifies the portion of pix-
els to mask, resulting in m = ⌈r × T ⌉ masked pixels. These
are selected randomly across the image.
The masking operation for both patch and pixel masking
(b) Frequency Domain Masking methods can be defined as:
Fig. 1: Our proposed training of universal deepfake detector (
using spatial and frequency domain masking. In both cases, L 0 if (i, j) ∈ m
M (i, j) = (1)
represents the binary cross-entropy loss for real/fake discrimi- I(i, j) otherwise
nation. (a) Our spatial domain masking uses either individual where M is the masked image and I is the original image.
pixels or patches to mask portions of the input image. (b) Both methods create a binary mask that is element-wise mul-
Our frequency domain masking transforms the input image tiplied with the original image to produce the masked im-
I(x, y) to the frequency domain F (u, v) using FFT. Guided age. These masking strategies serve as the foundation for our
by a frequency band selector and mask ratio, specific frequen- frequency-based masking approach, enabling effective uni-
cies within F (u, v) are nullified to yield M (u, v). The inverse versal deepfake detection by focusing on important features.
FFT produces the masked image Im (x, y), serving as the clas-
sifier input for training universal deepfake detection. We re- 2.2. Frequency Domain Masking
mark that masking is applied only in the supervised training Our frequency domain masking utilizes Fast Fourier Trans-
stage to encourage the detector to learn generalizable repre- form (FFT) to represent the image in terms of its frequency
sentation. No masking is applied in the testing stage. components. Given an image of dimensions H × W where H
and W are the height and width respectively, we first compute
by Corvi et al. [14], which discovers frequency artifacts in its frequency representation F (u, v) using the FFT:
GAN and diffusion model-based synthetic images. Different
from [14], our main contribution is a new training method to F (u, v) = F{I(x, y)} (2)
improve detection accuracy via frequency masking. We re-
Here u and v corresponds to the frequency along the im-
mark that most existing detectors focus on spatial domain [1,
age’s width and height, respectively. F denotes the FFT oper-
2, 7]. Our contributions are summarized as follows:
ation, and I(x, y) is the original image in spatial coordinates.
1. We present the first study to explore masked image The frequency masking is dictated by a masking ratio r
modeling for universal deepfake detection. and a specified frequency band (‘low’, ‘mid’, ‘high’, ‘all’).
2. We analyze two distinct types of masking methods The regions for each frequency band are defined as follows:
(spatial, frequency), and empirically demonstrate that
H W
frequency masking performs better (Fig. 1). • Low Band: 0 ≤ u < 4 ,0 ≤v< 4
3. We conduct analysis and experiments to validate the H 3H W 3W
effectiveness of universal deepfake detection via fre- • Mid Band: 4 ≤u< 4 , 4 ≤v< 4
quency masking. 3H
• High Band: 4 ≤ u < H, 3W
4 ≤v <W
2. METHODOLOGY • All Band: 0 ≤ u < H, 0 ≤ v < W
To improve the performance of universal deepfake detection The division of frequency bands into Low, Mid, and High
systems, we introduce masking strategies operating in vari- serves to isolate the contributions of specific frequency com-
ous domains—spatial and frequency. This section delves into ponents to the overall image features. These divisions are
90 Masked Frequency Bands
mAP Generative Models
Low Mid High All
Mean Average Precision (%)

GANs 95.73 93.92 94.94 96.16
85 DeepFake 85.64 87.22 80.17 79.07
Low-Level Vision 85.77 83.69 88.78 87.27
Perceptual Loss 99.21 97.98 98.11 98.41
80 Guided Diffusion 74.90 70.19 69.26 72.42
Latent Diffusion 76.35 75.60 66.39 80.45
Glide 85.69 80.65 73.66 84.98
75 DALL-E 70.60 68.72 71.18 80.11
Pixel Patch Frequency
average mAP 87.45 85.35 83.38 88.22
Mask Types
Table 2 Scores of mean Average Precision (mAP) across var-
ious generative models for random 15% masking of different
Fig. 2: Performance of different masking types in terms of frequency bands.
mean Average Precision (mAP) at 15% masking ratio. The
graph shows a marked improvement when transitioning from 3. EXPERIMENTS
Pixel to Patch, and eventually to Frequency-based mask-
ing. Specifically, Frequency-based masking attains the high- Dataset: In our experiments, we followed strictly the training
est mAP of 88.22%, underscoring its effectiveness over the and validation setup from from Wang et al. [7] using Pro-
other masking types. GAN with 720k and 4k samples, respectively. For testing, we
utilized data from models such as GANs (ProGAN, Cycle-
calculated based on the dimensions H × W of the Fourier GAN, BigGAN, StyleGAN, GauGAN, and StarGAN), Deep-
transform of the image. The Low Band, captures the coarse Fake, low-level vision models (SITD and SAN), and percep-
or global features of an image. These are the low-frequency tual loss models (CRN and IMLE) from Wang et al. [7]. Ad-
components that represent the most significant portions of the ditionally, we incorporated 1k samples per class from diffu-
image. The Mid Band targets medium-frequency compo- sion models [1]: Guided Diffusion, Latent Diffusion (LDM)
nents, which often account for textures and other finer details. with varying steps of noise refinement (e.g., 100, 200) and
Finally, the High Band focuses on high-frequency noise and generation guidance (w/ CFG), Glide which used two stages
edge details, which are less dominant but possibly crucial for of noise refinement steps. The first stage (e.g., 50, 100) is to
tasks like deepfake detection. The All Band simply includes get low-resolution 64 × 64 images and then use 27 steps to
all the frequencies, providing the most comprehensive mask- upsample the image to 256 × 256 in the next stage, and lastly
ing strategy. DALL-E-mini.
Given the selected frequency band [ustart , uend ]×[vstart , vend ], State-of-the-art (SOTA): We compared with following
the frequencies to be nullified N is calculated: N = ⌈r × SOTA: Wang et al. [7] and Gragnaniello et al. [2]. These
(uend − ustart ) × (vend − vstart )⌉. Frequencies within this region methods achieve SOTA detection accuracy across GANs and
are set to zero, resulting in a masked frequency representation diffusion models according to a recent study [15]. We ap-
M (u, v): plied our proposed frequency masking on these SOTA and
evaluated the improvement.
( As our default setting, we used Wang et al. [7] and incor-
0 if (u, v) ∈ N porated image augmentations like Gaussian Blur and JPEG
M (u, v) = (3)
F (u, v) otherwise compression, each having a 10% likelihood of being applied.
The masked image Im (x, y) is obtained by applying the in- 3.1. Comparison of Mask Types
verse FFT to M (u, v): Im (x, y) = F −1 {M (u, v)}.
For masking types, as depicted in Figure 2, we observe a
distinct hierarchy in terms of mean average precision (mAP).
Masking Ratio (%) Mean Average Precision (%) Pixel masking exhibits the lowest performance with an mAP
0% 85.86 of 75.12%. In contrast, patch masking with random 8x8
15% 88.22 blocks of pixels as patches shows a notable improvement
30% 87.20
with an mAP of 86.09%. However, the most interesting re-
50% 85.12
sult is the superior performance of frequency-based masking,
70% 83.86
which reaches an mAP of 88.22%. This clearly indicates
Table 1 The table presents mean average precision (mAP) that frequency-based masking captures more generalizable
scores for varying degrees of frequency masking. The highest features compared to pixel and patch masking, thereby sub-
mAP is achieved at a 15% masking ratio. stantiating its effectiveness in universal deepfake detection
task.
Generative Adversarial Networks Deep Low-Level Vision Perceptual Loss LDM Glide average
Method Variant Guided 200 DALL-E
Pro Cycle Big Style Gau Star Fake 200 100 100 50 100
SITD SAN CRN IMLE mAP
GAN GAN GAN GAN GAN GAN steps w/ CFG steps 27 27 10
Wang et al. [7] Blur+JPEG (0.5) 100.00 96.83 88.24 98.51 98.09 95.45 66.27 92.76 63.87 98.94 99.52 68.35 65.92 66.74 65.99 72.03 76.52 73.22 66.26 81.76
Wang et al. [7] + Ours Blur+JPEG (0.5) 100.00 92.97 90.06 98.15 97.80 87.94 73.09 89.92 74.57 96.97 98.00 70.98 77.75 72.44 78.21 78.53 82.61 79.27 78.16 85.13 (+3.37)
Wang et al. [7] Blur+JPEG (0.1) 100.00 93.47 84.51 99.59 89.49 98.15 89.02 97.23 70.45 98.22 98.39 77.67 71.16 73.01 72.53 80.51 84.62 82.06 71.30 85.86
Wang et al. [7] + Ours Blur+JPEG (0.1) 100.00 93.75 92.19 98.93 97.18 94.93 79.07 95.72 78.81 98.63 98.19 72.42 81.89 78.01 81.45 83.13 87.50 84.30 80.11 88.22 (+2.36)
Gragnaniello et al. [2] Blur+JPEG (0.1) 100.00 83.24 94.29 99.95 91.69 99.99 91.23 92.75 73.67 98.19 97.85 78.31 88.37 88.15 89.26 89.41 93.39 90.48 92.46 91.19
Gragnaniello et al. [2] + Ours Blur+JPEG (0.1) 100.00 94.07 97.61 99.92 98.52 99.99 94.01 95.44 81.64 96.73 95.47 80.10 94.60 94.46 95.02 90.68 93.72 91.42 96.47 94.20 (+3.01)
Ojha et al. [1] Blur+JPEG (0.1) 100.00 99.16 96.08 90.72 99.81 98.67 77.30 67.21 74.81 72.97 94.02 81.42 95.45 80.68 96.44 90.53 91.54 90.09 86.59 88.60
Ojha et al. [1] + Ours Blur+JPEG (0.1) 100.00 99.04 96.89 92.66 99.83 98.86 77.17 67.41 75.80 78.83 95.85 81.77 95.45 80.93 96.48 90.66 91.71 90.23 86.73 89.28 (+0.68)
Table 3 Generalization results showcasing the effectiveness of our frequency-based masking technique. We compare the Av-
erage Precision (AP) with Wang et al. [7], Gragnaniello et al. [2], and Ojha et al. [1], which are SOTA according to a recent
study [15]. When integrated with these SOTA methods, our approach consistently improves mAP by significant amounts, high-
lighting the utility of frequency masking in enhancing universal deepfake detection.
3.2. Ratio of Frequency Msking a notable improvement of 3.01%. These are testaments to the
effectiveness of frequency masking, as it enables the model
As shown in Table 1, we observe a clear trend in the perfor- to learn more generalizable features for universal deepfake
mance of our frequency masking technique across different detection that are potentially overlooked by Wang et al.’s [7]
masking ratios. The highest mean average precision (mAP) and Gragnaniello et al.’s [2] approaches.
of 88.22% is achieved at a masking ratio of 15%. As the Moreover, our approach also shines when coupled with
masking ratio increases, there is a noticeable decline in per- Ojha et al.’s [1] method which is based on visual encoder of
formance, with the mAP dropping to 83.86% at a 70% mask- CLIP [16], manifesting in a significant mAP boost of 0.68%.
ing ratio. This trend suggests that the model is sensitive to the This further affirms the adaptability of our approach across
proportion of frequencies being masked, and excessive mask- different backbones for universal deepfake detection. By
ing can compromise the model’s ability to detect subtle fea- strategically masking random frequencies, our model learns
tures in the images. Therefore, based on these results, we se- more generalizable features. Thus, these results shows our
lect 15% as the default masking ratio for frequency masking approach as valuable asset in the pursuit of more precise
in our experiments. universal deepfake detection.
3.3. Frequency Bands for Masking 4. CONCLUSION
Table 2 presents the mAP scores for various generative mod- Motivated by recent promising results of masked image mod-
els when different frequency bands—Low, Mid, High, and eling, we proposed a frequency-based masking strategy tai-
All—are randomly masked. The highest mAP is observed lored for universal detection of deepfakes. Our method en-
when all frequency bands are masked (88.22%), suggest- ables deepfake detector to learn generalizable features in the
ing that a comprehensive masking strategy might be most frequency domain. Our experiments, conducted using sam-
effective for universal deepfake detection. However, there ples with a range of generative models, empirically demon-
are nuances; for instance, DeepFake performs best with Mid- strate the effectiveness of our proposed approach. The su-
frequency masking (87.22%), while Low-Level Vision mod- perior performance of our method highlights the potential of
els excel at High-frequency masking (88.78%). This variance masked image modeling and frequency domain approach for
indicates that different generative models may leave distinct the challenging problem of universal deepfake detection.
forensic artifacts in different frequency bands, making a tar-
geted masking strategy potentially beneficial. 5. ACKNOWLEDGEMENT
3.4. Comparison with State-Of-The-Art This research is supported by the National Research Founda-
tion, Singapore under its AI Singapore Programmes (AISG
As shown in Table 3, ‘+ Ours’ approach, incorporating 15% Award No.: AISG2-TC-2022-007) and SUTD project PIE-
random masking across ‘all’ frequency bands, consistently SGP-AI-2018-01. This research work is also supported by the
enhances performance when integrated with existing state- Agency for Science, Technology and Research (A*STAR) un-
of-the-art (SOTA) methods. Specifically, the combination of der its MTC Programmatic Funds (Grant No. M23L7b0021).
our frequency-based masking technique with Wang et al.’s This material is based on the research/work support in part
method [7] results in an increase of 3.37% and 2.36% in by the Changi General Hospital and Singapore University
mean average precision (mAP) for the Blur+JPEG (0.5) and of Technology and Design, under the HealthTech Innovation
Blur+JPEG (0.1) variants, respectively. Furthermore, inte- Fund (HTIF Award No. CGH-SUTD-2021-004). We thank
grating our method with that of Gragnaniello et al. [2] yields anonymous reviewers for their insightful feedback.
References [10] K. He, X. Chen, S. Xie, Y. Li, P. Dollár, and R. B. Gir-
shick, “Masked autoencoders are scalable vision learn-
[1] U. Ojha, Y. Li, and Y. Lee, “Towards universal fake im- ers,” in IEEE/CVF Conference on Computer Vision
age detectors that generalize across generative models,” and Pattern Recognition, CVPR 2022, New Orleans, LA,
in 2023 IEEE/CVF Conference on Computer Vision and USA, June 18-24, 2022. 2022, pp. 15979–15988, IEEE.
Pattern Recognition (CVPR), 2023, pp. 24480–24489.
[11] J. Li, P. Chen, S. Yu, Z. He, S. Liu, and J. Jia, “Rethink-
[2] D. Gragnaniello, D. Cozzolino, F. Marra, G. Poggi, and ing out-of-distribution (ood) detection: Masked image
L. Verdoliva, “Are GAN generated images easy to de- modeling is all you need,” 2023 IEEE/CVF Conference
tect? A critical analysis of the state-of-the-art,” in on Computer Vision and Pattern Recognition (CVPR),
2021 IEEE International Conference on Multimedia and pp. 11578–11589, 2023.
Expo, ICME 2021, Shenzhen, China, July 5-9, 2021.
2021, pp. 1–6, IEEE. [12] J. Xie, W. Li, X. Zhan, Z. Liu, Y. Ong, and C. Loy,
“Masked frequency modeling for self-supervised visual
[3] L. Chai, D. Bau, S. Lim, and P. Isola, “What makes fake pre-training,” in The Eleventh International Confer-
images detectable? understanding properties that gen- ence on Learning Representations, ICLR 2023, Kigali,
eralize,” in Computer Vision - ECCV 2020 - 16th Eu- Rwanda, May 1-5, 2023. 2023, OpenReview.net.
ropean Conference, Glasgow, UK, August 23-28, 2020,
[13] J. Huang, K. Cui, D. Guan, A. Xiao, F. Zhan, S. Lu,
Proceedings, Part XXVI. 2020, vol. 12371 of Lecture
S. Liao, and E. Xing, “Masked generative adversarial
Notes in Computer Science, pp. 103–120, Springer.
networks are data-efficient generation learners,” in Neu-
[4] Milad Abdollahzadeh, Touba Malekzadeh, Christopher ral Information Processing Systems, 2022.
T. H. Teo, Keshigeyan Chandrasegaran, Guimeng Liu, [14] R. Corvi, D. Cozzolino, G. Poggi, K. Nagano, and
and Ngai-Man Cheung, “A survey on generative model- L. Verdoliva, “Intriguing properties of synthetic images:
ing with limited data, few shots, and zero shot,” ArXiv, from generative adversarial networks to diffusion mod-
vol. abs/2307.14397, 2023. els,” in IEEE/CVF Conference on Computer Vision and
Pattern Recognition, CVPR 2023 - Workshops, Vancou-
[5] Yisroel Mirsky and Wenke Lee, “The creation and
ver, BC, Canada, June 17-24, 2023. 2023, pp. 973–982,
detection of deepfakes,” ACM Computing Surveys
IEEE.
(CSUR), vol. 54, pp. 1 – 41, 2020.
[15] R. Corvi, D. Cozzolino, G. Zingarini, G. Poggi,
[6] Robin Rombach, A. Blattmann, Dominik Lorenz, K. Nagano, and L. Verdoliva, “On the detection of syn-
Patrick Esser, and Björn Ommer, “High-resolution thetic images generated by diffusion models,” in IEEE
image synthesis with latent diffusion models,” 2022 International Conference on Acoustics, Speech and Sig-
IEEE/CVF Conference on Computer Vision and Pattern nal Processing (ICASSP), 2023, vol. abs/2211.00680.
Recognition (CVPR), pp. 10674–10685, 2021.
[16] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya
[7] S. Wang, O. Wang, R. Zhang, A. Owens, and A. A. Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry,
Efros, “Cnn-generated images are surprisingly easy Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen
to spot... for now,” in 2020 IEEE/CVF Conference on Krueger, and Ilya Sutskever, “Learning transferable vi-
Computer Vision and Pattern Recognition, CVPR 2020, sual models from natural language supervision,” in In-
Seattle, WA, USA, June 13-19, 2020. 2020, pp. 8692– ternational Conference on Machine Learning, 2021.
8701, Computer Vision Foundation / IEEE.
[8] K. Chandrasegaran, N. Tran, A. Binder, and N. Che-

ung, “Discovering transferable forensic features for
cnn-generated images detection,” in Computer Vision
- ECCV 2022 - 17th European Conference, Tel Aviv, Is-
rael, October 23-27, 2022, Proceedings, Part XV. 2022,
vol. 13675 of Lecture Notes in Computer Science, pp.
671–689, Springer.
[9] L. Chen, Y. Zhang, Y. Song, J. Wang, and L. Liu, “OST:

improving generalization of deepfake detection via one-
shot test-time training,” in Neural Information Process-
ing Systems, 2022.

2401.06506v3

Uploaded by

Copyright:

Available Formats

2401.06506v3

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

2401.06506v3

Uploaded by

Copyright:

Available Formats

FREQUENCY MASKING FOR UNIVERSAL DEEPFAKE DETECTION

Chandler Timm Doloriel Ngai-Man Cheung∗

Singapore University of Technology and Design (SUTD)

ABSTRACT al. [8] focused on forensic features for universal deepfake

Mean Average Precision (%)

[8] K. Chandrasegaran, N. Tran, A. Binder, and N. Che-

[9] L. Chen, Y. Zhang, Y. Song, J. Wang, and L. Liu, “OST:

You might also like