Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
\extrafloats

1000

AI-Face: A Million-Scale Demographically Annotated AI-Generated Face Dataset and Fairness Benchmark

Li Lin1, Santosh1, Xin Wang2, Shu Hu1
1Purdue University     2University at Albany, SUNY
Corresponding author: Shu Hu (hu968@purdue.edu)
Abstract

AI-generated faces have enriched human life, such as entertainment, education, and art. However, they also pose misuse risks. Therefore, detecting AI-generated faces becomes crucial, yet current detectors show biased performance across different demographic groups. Mitigating biases can be done by designing algorithmic fairness methods, which usually require demographically annotated face datasets for model training. However, no existing dataset comprehensively encompasses both demographic attributes and diverse generative methods, which hinders the development of fair detectors for AI-generated faces. In this work, we introduce the AI-Face dataset, the first million-scale demographically annotated AI-generated face image dataset, including real faces, faces from deepfake videos, and faces generated by Generative Adversarial Networks and Diffusion Models. Based on this dataset, we conduct the first comprehensive fairness benchmark to assess various AI face detectors and provide valuable insights and findings to promote the future fair design of AI face detectors. Our AI-Face dataset and benchmark code are publicly available at https://github.com/Purdue-M2/AI-Face-FairnessBench.

Refer to caption
Figure 1: Overview of AI-Face dataset. Each face has three demographic annotations with uncertainty scores.

1 Introduction

AI-generated faces are created using sophisticated AI technologies that are visually difficult to discern from real ones [1]. They can be summarized into three categories: deepfake videos [2] created by typically using Variational Autoencoders (VAEs) [3, 4], faces generated from Generative Adversarial Networks (GANs) [5, 6, 7, 8], and Diffusion Models (DMs) [9]. These technologies have significantly advanced the realism and controllability of synthetic facial representations. Generated faces can enrich media and increase creativity [10]. However, they also carry significant risks of misuse. For example, during the 2024 United States presidential election, fake face images of Donald Trump surrounded by groups of black people smiling and laughing to encourage African Americans to vote Republican are spreading online [11]. This could distort public opinion and erode people’s trust in media [12, 13], necessitating the detection of AI-generated faces for their ethical use.

However, one major issue existing in current AI face detectors [14, 15, 16, 17] is biased detection (i.e., unfair detection performance among demographic groups [18, 19, 20, 21]). Mitigating biases can be done by designing algorithmic fairness methods, but they usually require demographically annotated face datasets for model training. For example, works like  [20, 21] have made efforts to enhance fairness in the detection based on A-FF++ [19] and A-DFD [19]. However, both datasets are limited to containing only faces from deepfake videos, which could cause the trained models not to be applicable for fairly detecting faces generated by GANs and DMs. Although a few datasets (e.g., GenData [22]) cover GAN and DM faces, their demographic annotations are not comprehensive. Most importantly, no existing dataset is diverse enough in generation methods to develop AI face detectors that can cope with rapidly evolved generative models. These limitations of existing datasets hamper the development of fair technologies for detecting AI-generated faces.

Moreover, benchmarking fairness provides a direct method to uncover prevalent and unique fairness issues in recent AI-generated face detection. However, there is a lack of a comprehensive benchmark to estimate the fairness of existing AI face detectors. Existing benchmarks [23, 24, 25, 26] primarily assess utility, neglecting systematic fairness evaluation. One study [18] does evaluate fairness in detection models, but their examination is only based on deepfake video datasets using a few outdated detectors. Detectors’ fairness performance on GAN faces and DM faces has not been extensively explored. The absence of a comprehensive fairness benchmark impedes a thorough understanding of the fairness behaviors of recent AI face detectors and obscures the research path for detector fairness guarantees.

In this work, we build the first million-scale demographically annotated AI-generated face image dataset: AI-Face (see Fig. 1). The face images are collected from various public datasets, including the real faces that are usually used to train AI face generators, faces from deepfake videos, and faces generated by GANs and DMs. Each face is demographically annotated with an uncertainty score on each predicted demographic attribute by our designed Contrastive Language-Image Pretraining (CLIP) [27]-based lightweight annotator. To improve the quality of annotations, we recruit three humans to correct annotations with high uncertainty scores manually. Next, we conduct the first comprehensive fairness benchmark on our dataset to estimate the fairness performance of 12 representative detectors coming from four model types. Our benchmark exposes common and unique fairness challenges in recent AI face detectors, providing essential insights that can guide and enhance the future design of fair AI face detectors. Our contributions are as follows:

  • We build the first comprehensive million-scale demographically annotated AI-generated face Dataset by leveraging our developed lightweight annotator with human correction.

  • We conduct the first comprehensive fairness benchmark of AI-generated face detectors, providing an extensive fairness assessment of current representative detectors.

  • Based on our experiments and observations, we summarize the unsolved questions and offer valuable insights within this research domain, setting the stage for future investigations.

2 Background and Motivation

AI-generated Faces and Biased Detection. AI-generated face images, created by advanced AI technologies, are visually difficult to discern from real ones, see Fig. 1. They can be summarized into three categories: 1) Deepfake Videos. Initiated in 2017 [13], these use face-swapping techniques with a variational autoencoder to replace a face in a target video with one from a source [3, 4]. Note that our paper focuses solely on images extracted from videos. 2) GAN-generated Faces. Post-2017, Generative Adversarial Networks (GANs) [28] like StyleGANs [6, 7, 8] have significantly improved generated face realism. 3) DM-generated Faces. Diffusion models (DMs), emerging in 2021, generate detailed faces from textual descriptions and offer greater controllability. Tools like Midjourney [29] and DALLE2 [30] facilitate customized face generation. While these AI-generated faces can enhance visual media and creativity [10], they also pose risks, such as being misused in social media profiles [31, 32]. Therefore, numerous studies focus on detecting AI-generated faces [14, 15, 16, 17], but current detectors often show performance disparities among demographic groups like race and gender [18, 19, 20, 21]. This bias can lead to unfair targeting or exclusion, undermining trust in detection models. Recent efforts [20, 21] aim to enhance fairness in deepfake detection but mainly address deepfake videos, overlooking biases in detecting GAN and DM-generated faces.

Dataset Face Images Generation Category #Generation Methods Source of Real Images Demographic Annotation
#Real #Fake Deepfake Videos GAN DM Gender Race Age
A-FF++ [19] 29.8K 149.1K 5 YouTube
A-DFD [19] 10.8K 89.6K 5 Self-Recording
A-DFDC [19] 54.5K 52.6K 8 Self-Recording
A-Celeb-DF-v2 [19] 26.3K 166.5K 1 Self-Recording
A-DF-1.0 [19] 870.3K 321.5K 1 Self-Recording
DF-1.0 [33] 2.9M 14.7M 1 Self-Recording
DeePhy [34] 1K 50.4K 3 YouTube
DF-Platter [35] 392.3K 653.4K 3 YouTube
GenData [22] - 20K 3 CelebA [36]
Ours 866K 1.2M 37
FFHQ [6], CASIA-WebFace [37], CelebA [36]
IMDB-WIKI [38], real from FF++ [2],
DFDC [39], DFD [40],Celeb-DF-v2 [41]
Table 1: Quantitative comparison of existing AI-generated face datasets and ours.

The Related Existing Datasets. Current AI-generated facial datasets with demographic annotations are limited in size, generation categories, methods, and annotations, as illustrated in Table 1. For instance, A-FF++, A-DFD, A-DFDC, and A-Celeb-DF-v2 [19] are deepfake video datasets with fewer than one million images. Datasets like DF-1.0 [33] and DF-Platter [35] lack comprehensive demographic annotations. Additionally, existing datasets offer limited generation methods. These limitations hinder the development of fairer AI face detectors, motivating us to build a million-scale demographically annotated AI-Face dataset.

Existing Benchmarks Category Scope of Benchmark
Deepfake Videos GAN DM Utility Fairness
DeepfakeBench [25]
Lin et al. [24]
Le et al. [26]
CDDB [23]
Loc et al. [18]
Ours
Table 2: Comparison with existing AI-generated face detection benchmarks.

Benchmark for Detecting AI-generated Faces. Benchmarks are essential for evaluating AI-generated face detectors under standardized conditions. Existing benchmarks, as shown in Table 2, mainly focus on detectors’ utility, often overlooking fairness [23, 24, 25, 26]. Only Loc et al. [18] examined detector fairness. However, their study focused only on deepfake video datasets, not on GAN- and DM-generated faces. This motivates us to conduct a comprehensive benchmark to evaluate AI face detectors’ fairness.

3 The Demographically Annotated AI-Face Dataset

To address the prohibitive time consuming of manual annotation, we introduce two phases to build our dataset: Annotator Development and Demographically Annotation Generation, as shown in Fig. 2.

3.1 Phase 1: Annotator Development

Problem Definition. There are existing online software (e.g., Face++ [42]) and open-source tools (e.g., InsightFace [43]) for face attribute prediction. However, they fall short of our task due to two reasons: 1) They are mostly designed for face recognition and trained on datasets of real face images but lack generalization capability for annotating AI-generated face images. 2) They do not provide uncertainty scores for their predictions that can be used to identify mispredicted samples for further annotation correction. Given a training dataset 𝔻={(Xi,Gi,Ai,Ri)}i=1n𝔻superscriptsubscriptsubscript𝑋𝑖subscript𝐺𝑖subscript𝐴𝑖subscript𝑅𝑖𝑖1𝑛\mathbb{D}=\{(X_{i},G_{i},A_{i},R_{i})\}_{i=1}^{n}blackboard_D = { ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT with size n𝑛nitalic_n, where Xisubscript𝑋𝑖X_{i}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, Gi{Female,Male}subscript𝐺𝑖𝐹𝑒𝑚𝑎𝑙𝑒𝑀𝑎𝑙𝑒G_{i}\in\{Female,Male\}italic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ { italic_F italic_e italic_m italic_a italic_l italic_e , italic_M italic_a italic_l italic_e }, Ai{Young,Middle-aged,Senior,Others}subscript𝐴𝑖𝑌𝑜𝑢𝑛𝑔𝑀𝑖𝑑𝑑𝑙𝑒-𝑎𝑔𝑒𝑑𝑆𝑒𝑛𝑖𝑜𝑟𝑂𝑡𝑒𝑟𝑠A_{i}\in\{Young,Middle\text{-}aged,Senior,Others\}italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ { italic_Y italic_o italic_u italic_n italic_g , italic_M italic_i italic_d italic_d italic_l italic_e - italic_a italic_g italic_e italic_d , italic_S italic_e italic_n italic_i italic_o italic_r , italic_O italic_t italic_h italic_e italic_r italic_s }, and Ri{Asian,White,Black,Others}subscript𝑅𝑖𝐴𝑠𝑖𝑎𝑛𝑊𝑖𝑡𝑒𝐵𝑙𝑎𝑐𝑘𝑂𝑡𝑒𝑟𝑠R_{i}\in\{Asian,White,Black,Others\}italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ { italic_A italic_s italic_i italic_a italic_n , italic_W italic_h italic_i italic_t italic_e , italic_B italic_l italic_a italic_c italic_k , italic_O italic_t italic_h italic_e italic_r italic_s } represent the i𝑖iitalic_i-th face image, and its gender, age, and race labels/attributes, respectively. Our goal is to design a lightweight, generalizable annotator based on 𝔻𝔻\mathbb{D}blackboard_D to predict facial demographic attributes with uncertainty scores for each face image in our dataset.

Refer to caption
Figure 2: Generation pipeline of our Demographically Annotated AI-Face Dataset. First, we develop a lightweight facial attribute annotator trained on the VGGFace2 dataset. Second, we collect and filter face images from Deepfake Videos, GAN-generated faces, and DM-generated faces found in public datasets. Our AI-Face annotator then predicts facial attributes with uncertainty scores for each image. Third, samples with high uncertainty are manually reviewed and corrected by three humans to improve annotation quality.

Annotator. Architecture: We utilize CLIP [27] for its strong zero-shot and few-shot learning capabilities. Leveraging CLIP’s pre-training on diverse datasets, we create a lightweight annotator for facial images. Our annotator employs a frozen pre-trained CLIP ViT L/14 [44] as a feature extractor 𝐄𝐄\mathbf{E}bold_E followed by a trainable 3-layer Multilayer Perceptron (MLP) as a multi-task (i.e., gender, age, and race prediction) classifier parameterized by θ𝜃\thetaitalic_θ. Loss: For each image Xisubscript𝑋𝑖X_{i}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, its feature fisubscript𝑓𝑖f_{i}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is obtained through fi=𝐄(Xi)subscript𝑓𝑖𝐄subscript𝑋𝑖f_{i}=\mathbf{E}(X_{i})italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = bold_E ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) and then is fed into the MLP multi-task classifier with conventional classification losses for face attribute prediction. The learning objective is formulated as: (θ)=C(h~(fi),Gi)+C(h¯(fi),Ai)+C(h^(fi),Ri)𝜃𝐶~subscript𝑓𝑖subscript𝐺𝑖𝐶¯subscript𝑓𝑖subscript𝐴𝑖𝐶^subscript𝑓𝑖subscript𝑅𝑖\mathcal{L}(\theta)=C(\widetilde{h}(f_{i}),G_{i})+C(\overline{h}(f_{i}),A_{i})% +C(\widehat{h}(f_{i}),R_{i})caligraphic_L ( italic_θ ) = italic_C ( over~ start_ARG italic_h end_ARG ( italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , italic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + italic_C ( over¯ start_ARG italic_h end_ARG ( italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + italic_C ( over^ start_ARG italic_h end_ARG ( italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), where C(,)𝐶C(\cdot,\cdot)italic_C ( ⋅ , ⋅ ) represents the (binary) cross-entropy (CE) loss. h~~\widetilde{h}over~ start_ARG italic_h end_ARG, h¯¯\overline{h}over¯ start_ARG italic_h end_ARG, and h^^\widehat{h}over^ start_ARG italic_h end_ARG represent the classification heads for gender, age, and race, respectively. Optimization: Traditional optimization methods like stochastic gradient descent can lead to poor model generalization due to sharp loss landscapes with multiple local and global minima. To address this, we use Sharpness-Aware Minimization (SAM) [45] to enhance our annotator’s generalization by flattening the loss landscape. Specifically, flattening is attained by determining the optimal ϵsuperscriptitalic-ϵ\epsilon^{*}italic_ϵ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT for perturbing model parameters θ𝜃\thetaitalic_θ to maximize the loss, formulated as: ϵ=argmaxϵ2γ(θ+ϵ)argmaxϵ2γϵθ=γsign(θ)superscriptitalic-ϵsubscriptsubscriptnormitalic-ϵ2𝛾(𝜃italic-ϵ)subscriptsubscriptnormitalic-ϵ2𝛾superscriptitalic-ϵtopsubscript𝜃𝛾signsubscript𝜃\epsilon^{*}=\arg\max_{\|\epsilon\|_{2}\leq\gamma}{\mathcal{L}}\textbf{(}% \theta+\epsilon\textbf{)}\approx\arg\max_{\|\epsilon\|_{2}\leq\gamma}\epsilon^% {\top}\nabla_{\theta}\mathcal{L}=\gamma\texttt{sign}(\nabla_{\theta}\mathcal{L})italic_ϵ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = roman_arg roman_max start_POSTSUBSCRIPT ∥ italic_ϵ ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ italic_γ end_POSTSUBSCRIPT caligraphic_L ( italic_θ + italic_ϵ ) ≈ roman_arg roman_max start_POSTSUBSCRIPT ∥ italic_ϵ ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ italic_γ end_POSTSUBSCRIPT italic_ϵ start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT caligraphic_L = italic_γ sign ( ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT caligraphic_L ), where γ𝛾\gammaitalic_γ controls the perturbation magnitude. This is approximated using a first-order Taylor expansion, assuming ϵitalic-ϵ\epsilonitalic_ϵ is small. The final equation is obtained by solving a dual norm problem, where sign represents a sign function and θsubscript𝜃\nabla_{\theta}\mathcal{L}∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT caligraphic_L being the gradient of \mathcal{L}caligraphic_L with respect to θ𝜃\thetaitalic_θ. As a result, the model parameters are updated by solving: minθ(θ+ϵ)subscript𝜃(𝜃superscriptitalic-ϵ)\min_{\theta}\mathcal{L}\textbf{(}\theta+\epsilon^{*}\textbf{)}roman_min start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT caligraphic_L ( italic_θ + italic_ϵ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ).

Uncertainty Estimation. Although the high prediction performance of our annotator can be obtained, the labels may still be mispredicted due to the ambiguity of the face images (see an example in Fig. 3). Therefore, it is crucial to provide an uncertainty score for each prediction from the annotator. To this end, inspired by [46], we incorporate dropout techniques at each layer of MLP for uncertainty estimation in testing. This involves performing k𝑘kitalic_k stochastic forward passes for a given test image X𝑋Xitalic_X, each with a unique dropout pattern. So, we can obtain k𝑘kitalic_k distinct softmax outputs for each demographic attribute a𝑎aitalic_a, denoted as {x1,(a),,xk,(a)}superscript𝑥1𝑎superscript𝑥𝑘𝑎\{x^{1,(a)},...,x^{k,(a)}\}{ italic_x start_POSTSUPERSCRIPT 1 , ( italic_a ) end_POSTSUPERSCRIPT , … , italic_x start_POSTSUPERSCRIPT italic_k , ( italic_a ) end_POSTSUPERSCRIPT }. Then, the uncertainty score for a𝑎aitalic_a on image X𝑋Xitalic_X (denoted as V(X(a))𝑉superscript𝑋𝑎V(X^{(a)})italic_V ( italic_X start_POSTSUPERSCRIPT ( italic_a ) end_POSTSUPERSCRIPT )) is calculated as V(X(a))=1{1ρki=1kxi,(a)ρk2i=1kj=1k|xi,(a)xj,(a)|}𝑉superscript𝑋𝑎11𝜌𝑘superscriptsubscript𝑖1𝑘superscript𝑥𝑖𝑎𝜌superscript𝑘2superscriptsubscript𝑖1𝑘superscriptsubscript𝑗1𝑘superscript𝑥𝑖𝑎superscript𝑥𝑗𝑎V(X^{(a)})=1-\Big{\{}\frac{1-\rho}{k}\sum_{i=1}^{k}x^{i,(a)}-\frac{\rho}{k^{2}% }\sum_{i=1}^{k}\sum_{j=1}^{k}|x^{i,(a)}-x^{j,(a)}|\Big{\}}italic_V ( italic_X start_POSTSUPERSCRIPT ( italic_a ) end_POSTSUPERSCRIPT ) = 1 - { divide start_ARG 1 - italic_ρ end_ARG start_ARG italic_k end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_x start_POSTSUPERSCRIPT italic_i , ( italic_a ) end_POSTSUPERSCRIPT - divide start_ARG italic_ρ end_ARG start_ARG italic_k start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT | italic_x start_POSTSUPERSCRIPT italic_i , ( italic_a ) end_POSTSUPERSCRIPT - italic_x start_POSTSUPERSCRIPT italic_j , ( italic_a ) end_POSTSUPERSCRIPT | }, where ρ[0,1]𝜌01\rho\in[0,1]italic_ρ ∈ [ 0 , 1 ] is a user-defined parameter to counterweight the measure of centrality (i.e., the first term in {}\{\}{ } indicates the likelihood of the prediction being correct) and dispersion (i.e., the second term in {}\{\}{ } reflects the consensus among the stochastic outputs).

Evaluation. To demonstrate our annotator’s effectiveness, we will answer the following questions: Q1: How are the general performance and generalization capability of our annotator compared with the baselines? Q2: How does sample difficulty affect the annotator’s performance? In leveraging the good generalization capabilities of CLIP, our annotator is trained on the VGGFace2 [47] dataset, which contains 9.1K individuals with 3.3M images. More importantly,  [48] provides comprehensive demographic annotations for this dataset. We compare our annotator with the current state-of-the-art face attribute prediction tools Face++ [42] and InsightFace [43]. Since they do not offer predictions for the race attribute, our evaluation is confined to gender and age. The mean and standard deviation are reported based on 5 random runs. More details are in Appendix A.1.1.

For Q1, Setting: We perform intra-domain (train on VGGFace2, test on its official test set) and cross-domain (train on VGGFace2, test on four AI-generated face datasets) evaluations. Specifically, A-FF++, A-DFDC, A-DFD, and A-Celeb-DF-v2 are selected from [19] for cross-domain evaluation. Since A-DFD and A-Celeb-DF-v2 have limited age and race annotations, our evaluation of these two is confined to gender. These datasets are chosen because they closely match our objective and are not used to train Face++ and InsightFace. Results: The ‘All’ results in Table 3 demonstrate our annotator’s superiority in general performance and generalization capability against Face++ and InsightFace. Under intra-domain evaluation, it surpasses the second-best method, Face++, by 5.8% on gender and 18.9% on age. In cross-domain evaluation, our annotator maintains high accuracy on all datasets, reflecting good generalization. Remarkably, on the A-FF++ dataset, our annotator outperforms Face++ by a substantial margin of up to 11.4% and InsightFace by 16.1% on age.

Level
All Easy Medium Hard
Type Dataset Attribute
InsightFace
[43]
Face++
[42]
Ours
InsightFace
[43]
Face++
[42]
Ours
InsightFace
[43]
Face++
[42]
Ours
InsightFace
[43]
Face++
[42]
Ours
76.7289 78.0764 83.8978 97.0133 97.0863 99.7333 74.2400 75.356 87.5333 58.9333 61.787 64.4267
Gender (0.4985) (0.4266) (0.3697) (0.1293) (0.3414) (0.1265) (0.8182) (0.5938) (0.5007) (0.5481) (0.3445) (0.4818)
54.4311 58.4889 77.4044 68.000 73.0067 98.4133 49.8000 53.2467 78.9733 45.4933 49.2134 54.8267
Intra- Domain VGGFace2 [47] Age (0.7443) (0.7341) (0.6714) (0.5530) (0.6534) (0.1543) (0.6613) (0.8465) (1.0771) (1.0186) (0.7025) (0.7827)
84.9733 89.1714 91.3000 96.8267 98.0528 98.9333 88.0933 94.3074 98.8667 70.0000 75.1539 76.1000
Gender (0.4651) (0.1974) (0.2058) (0.3832) (0.1483) (0.0943) (0.4668) (0.2586) (0.1033) (0.5452) (0.1854) (0.4197)
59.4867 64.1893 75.5393 71.254 80.5980 93.1980 58.1720 63.8340 81.1960 49.0340 48.1360 52.2240
A-FF++ [19] Age (0.9291) (0.7609) (0.5130) (0.5973) (0.4140) (0.3110) (0.4489) (0.6733) (0.4702) (1.7410) (1.1954) (0.7577)
70.1111 76.0917 78.2922 85.8533 92.1414 96.2933 68.4666 73.5088 76.2005 56.0133 62.6249 62.3334
Gender (0.5037) (0.6290) (0.5178) (0.5239) (0.5447) (0.5927) (0.5667) (0.4910) (0.5028) (0.5014) (0.8513) (0.4580)
66.6967 69.5907 77.1800 72.1580 84.5000 95.3820 64.398 64.238 73.1620 63.5340 60.034 62.9960
A-DFDC [19] Age (0.8015) (0.5687) (0.6300) (0.6785) (0.4908) (0.3247) (0.9182) (0.5423) (0.8592) (0.8078) (0.6730) (0.7061)
66.7156 70.7983 74.9822 85.5467 88.9791 94.0400 60.5467 64.2144 70.0267 54.0533 59.2015 60.8800
A-DFD [19] Gender (0.7681) (1.2229) (0.6029) (0.6791) (0.8297) (0.2999) (0.9017) (1.3436) (0.9471) (0.7235) (1.4953) (0.5616)
91.9244 90.8100 95.1489 98.9733 98.1867 99.9867 94.0000 94.4933 99.7600 82.8000 79.7500 85.7000
Cross- Domain A-Celeb- DF-v2[19] Gender (0.3003) (0.4487) (0.4088) (0.1769) (0.2286) (0.0267) (0.3239) (0.5052) (0.0998) (0.4000) (0.6124) (1.1000)
Table 3: Comparing our annotator against Face++ [42] and InsightFace [43] under intra-domain and cross-domain evaluations (Accuracy (%)) on different levels of sample difficulty. The prediction mean and standard deviation (in parentheses) are reported. The best results are shown in Bold. More results are in Appendix A.1.3.

For Q2, Setting: We also design a stratified evaluation method by separating each test dataset into three subsets—Easy, Medium, and Hard based on the estimated uncertainty scores. Specifically, for each demographic attribute a𝑎aitalic_a, we define two thresholds t1asuperscriptsubscript𝑡1𝑎t_{1}^{a}italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT and t2asuperscriptsubscript𝑡2𝑎t_{2}^{a}italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT, where t1a<t2asuperscriptsubscript𝑡1𝑎superscriptsubscript𝑡2𝑎t_{1}^{a}<t_{2}^{a}italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT < italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT (more details are in Appendix A.1.2). Then, we have Easy={XV(X(a))<t1a}𝐸𝑎𝑠𝑦conditional-set𝑋𝑉superscript𝑋𝑎superscriptsubscript𝑡1𝑎Easy={\{X\mid V(X^{(a)})<t_{1}^{a}\}}italic_E italic_a italic_s italic_y = { italic_X ∣ italic_V ( italic_X start_POSTSUPERSCRIPT ( italic_a ) end_POSTSUPERSCRIPT ) < italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT }, Medium={Xt1aV(X(a))t2a}𝑀𝑒𝑑𝑖𝑢𝑚conditional-set𝑋superscriptsubscript𝑡1𝑎𝑉superscript𝑋𝑎superscriptsubscript𝑡2𝑎Medium=\{X\mid t_{1}^{a}\leq V(X^{(a)})\leq t_{2}^{a}\}italic_M italic_e italic_d italic_i italic_u italic_m = { italic_X ∣ italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT ≤ italic_V ( italic_X start_POSTSUPERSCRIPT ( italic_a ) end_POSTSUPERSCRIPT ) ≤ italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT }, and Hard={XV(X(a))>t2a}𝐻𝑎𝑟𝑑conditional-set𝑋𝑉superscript𝑋𝑎superscriptsubscript𝑡2𝑎Hard=\{X\mid V(X^{(a)})>t_{2}^{a}\}italic_H italic_a italic_r italic_d = { italic_X ∣ italic_V ( italic_X start_POSTSUPERSCRIPT ( italic_a ) end_POSTSUPERSCRIPT ) > italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT }. Next, we sample 1,500 images from each subset. This stratification is crucial for a thorough examination of the model’s performance across a broad spectrum of data challenges. To avoid attribute-specific biases, each subset is balanced with respect to attribute. Results: Table 3 illustrates that while all methods show decreased accuracy as the sample difficulty level increases, our annotator demonstrates greater resilience. For example, under intra-domain evaluation, our annotator’s gender performance drops by 10.2% from easy to medium difficulty, compared to Face++’s 21.7% drop. In cross-domain scenario, our annotator experiences a 14.3% reduction on gender in A-Celeb-DF-v2 [19], versus InsightFace’s 16.2% from easy to hard.

3.2 Phase 2: Demographically Annotation Generation

Data Collection. We build our AI-Face dataset by collecting and integrating public AI-generated face images sourced from academic publications, GitHub repositories, and commercial tools. More details are in Appendix A.2.1. Specifically, the fake face images in our dataset originate from 4 Deepfake Video datasets (i.e., A-FF++ [19], A-DFDC [19], A-DFC [19], and A-Celeb-DF-v2 [19]), generated by 10 GAN models (i.e., AttGAN [49], MMDGAN [50], StarGAN [49], StyleGANs [49, 51, 52], MSGGAN [50], ProGAN [53], STGAN [50], and VQGAN [54]), and 8 DM models (i.e., DALLE2 [55], IF [55], Midjourney [55], DCFace [56], Latent Diffusiin [57], Palette [58], Stable Diffusion v1.5 [59], Stable Diffusion Inpainting [59]). This constitutes a total of 1,245,660 fake face images in our dataset. These fake images are correspondingly generated from 8 real source datasets (i.e., FFHQ [6], CASIA-WebFace [37], IMDB-WIKI [38], CelebA [36], and real images from FF++ [2], DFDC [39], DFD [40], and Celeb-DF-v2 [41]). This constitutes a total of 866,096 real face images in our dataset. In general, our dataset contains 30 subsets and 37 generation methods (i.e., 5 in A-FF++, 5 in A-DFD, 8 in A-DFDC, 1 in A-Celeb-DF-v2, 10 GANs, and 8 DMs). We use RetinaFace [60] for detecting and cropping faces to ensure each image only contains one face.

Annotator Prediction. For our collected images, annotation generation is iterative, integrating uncertainty scores into each prediction by our annotator in Phase 1, as shown in Fig. 2.

Human Correction. As described in ‘Uncertainty Estimation’ in Section 3.1, the annotator may mispredict ambiguous face images, necessitating human review and correction. To this end, we propose two annotation correction strategies: 1) For subsets that have the same images and demographic attribute classes as those in existing datasets, such as A-FF++ [19] and A-DFDC [19], we filter out images that may need human correction based on annotation inconsistency.

Refer to caption
Figure 3: Uncertainty-based strategy for identifying ambiguous faces for human correction.

2) For the rest of the subsets, we identify the most ambiguous images that need human correction based on uncertainty scores. Specifically, for demographic attribute ajsubscript𝑎𝑗a_{j}italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT on subset j𝑗jitalic_j, we define a specific threshold tajsuperscript𝑡subscript𝑎𝑗t^{a_{j}}italic_t start_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUPERSCRIPT (more details are in Appendix A.2.2). If V(X(aj))>taj𝑉superscript𝑋subscript𝑎𝑗superscript𝑡subscript𝑎𝑗V(X^{(a_{j})})>t^{a_{j}}italic_V ( italic_X start_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT ) > italic_t start_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, the annotation for attribute ajsubscript𝑎𝑗a_{j}italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT of the image X𝑋Xitalic_X will undergo a verification process, potentially requiring human re-annotation (see Fig. 3). In practice, we recruit three humans to correct the filtered images, consolidating their evaluations with a majority vote to finalize annotations.

Evaluation. To estimate our dataset’s quality, we will answer the following questions: Q1: Can we directly incorporate the existing annotations into our dataset? Q2: How is the effectiveness of human correction? Q3: How is the overall annotations’ quality of our dataset?

Gender Age Race
Type Datasets ACC(%) Precision(%) Recall(%) ACC(%) Precision(%) Recall(%) ACC(%) Precision(%) Recall(%)
A-FF++ [19] 8.0163 17.3354 5.8314 19.9002 30.6658 29.6071 28.7865 35.7122 41.1687
Ours-FF++ (w/o Correction) 91.9837 82.6646 94.1684 21.1830 32.1232 45.7231 45.9775 50.3803 40.1949
A-DFDC [19] 20.2252 27.5332 21.6538 16.7493 29.0640 29.5519 18.1115 15.1092 22.0637
For Q1 Ours-DFDC (w/o Correction) 79.7748 72.4668 78.3462 45.9748 49.4734 48.7861 70.9001 64.7655 65.1608
Ours (w/o Correction) 83.4167 83.4167 83.4242 43.8333 43.8333 54.1792 67.4167 65.0718 59.2350
For Q2 Ours 84.8333 84.8738 84.8599 44.7500 44.0937 54.6033 68.8333 66.6440 61.3225
For Q3 Ours 98.6667 98.6688 98.6667 56.2500 50.1748 53.0514 86.2500 75.5216 67.4076
Table 4: Evaluation results of our dataset annotation quality for questions Q1, Q2, and Q3. ‘Ours-FF++ (w/o Correction),’ ‘Ours-DFDC (w/o Correction),’ and ‘Ours (w/o Correction)’ represent our predicted annotations on A-FF++, A-DFDC, and our entire dataset without human correction, respectively. ACC represents Accuracy.

For Q1, Setting: We compare our dataset’s annotation quality before human correction on A-FF++ (i.e., Ours-FF++ (w/o Correction)) and A-DFDC (i.e., Ours-DFDC (w/o Correction)) against their existing annotation from [19]. We regard human re-labeled annotations as the ground truth. Results: The results in Table 4 ‘For Q1’ show superior annotation accuracy of our datasets. For example, Ours-FF++ (w/o Correction) surpasses A-FF++ by 83.97% in gender accuracy, and Ours-DFDC (w/o Correction) exceeds A-DFDC by 59.55%. The large performance indicates that identified images by annotation inconsistency are mislabeled in A-FF++ [19] and A-DFDC [19], and thus cannot be directly merged into our dataset. Some examples are shown in Appendix A.2.3.

For Q2, Setting: We consider two dataset versions: 1) Ours (w/o Correction), where annotations are not corrected by humans. 2) Ours, where annotations are corrected by humans. With the help of the uncertainty score, we sample 1,200 attribute-balanced images (400 easy, 400 medium, and 400 hard) from the whole dataset to ensure a fair evaluation. Three humans re-annotated these images to establish ground truth. Results: Table 4 ‘For Q2’ shows that human corrections improve performance across all attributes, increasing accuracy by 1.42% for gender, 0.92% for age, and 1.42% for race, validating the effectiveness of our correction strategy. More results see Appendix A.2.4.

For Q3, Setting: We randomly sample 1,200 images from the whole dataset. Three humans also re-annotated these images to create ground truth. Results: As shown in Table 4 ‘For Q3’, Ours reflects the approximate overall annotation quality of our dataset. Notably, the annotations of gender and race attributes show high correctness (e.g., 98.6667% ACC on gender and 86.2500% ACC on race). However, the age annotation shows a lower accuracy since it is challenging to differentiate.

4 Fairness Benchmark Experiments

In this section, we estimate the existing AI-generated image detectors’ fairness performance alongside their utility on our AI-Face Dataset (80%/20% for Train/Test). Our goal is to show the significance of our dataset and expose the fairness issues of recent detectors in combating AI-generated faces.

Detection Methods. Our benchmark has implemented 12 detectors, as detailed in Appendix B.1. The methodologies cover a spectrum that is specifically tailored to detect AI-generated faces from Deepfake Videos, GANs, and DMs. They can be classified into four types: Naive detectors: refer to backbone models that can be directly utilized as the detector for binary classification, including CNN-based (i.e., Xception [61] and EfficientB4 [62]) and transformer-based (i.e., ViT-B/16 [63]). Frequency-based: explore the frequency domain for forgery detection (i.e., F3Net [64], SPSL [65], and SRM [66]). Spatial-based: focus on mining spatial characteristics (e.g., texture) within images for detection (i.e., UCF [16], UnivFD [67], and CORE [68]). Fairness-enhanced: focus on improving fairness in AI-generated face detection by designing specific algorithms (i.e., DAW-FDD [20], DAG-FDD [20], and PG-FDD [21]). Implementation and training details refer to Appendix B.2.

Model Type
Naive Frequency Spatial Fairness-enhanced
Measure Attribute Metric
Xception
 [61]
EfficientB4
 [62]
ViT-B/16
 [63]
F3Net
 [64]
SPSL
 [65]
SRM
 [66]
UCF
 [16]
UnivFD
 [67]
CORE
 [68]
DAW-FDD
 [20]
DAG-FDD
 [20]
PG-FDD
 [21]
FMEO(%)F_{MEO}(\%)italic_F start_POSTSUBSCRIPT italic_M italic_E italic_O end_POSTSUBSCRIPT ( % ) 0.387 1.176 0.187 0.279 0.454 0.533 0.305 0.458 1.635 0.404 0.272 0.236
FDP(%)F_{DP}(\%)italic_F start_POSTSUBSCRIPT italic_D italic_P end_POSTSUBSCRIPT ( % ) 2.843 2.052 2.489 2.941 2.998 2.433 2.890 2.456 1.977 2.979 2.799 2.614
FOAE(%)F_{OAE}(\%)italic_F start_POSTSUBSCRIPT italic_O italic_A italic_E end_POSTSUBSCRIPT ( % ) 0.271 0.595 0.422 0.086 0.188 0.268 0.169 0.557 0.977 0.123 0.192 0.134
Gender FEO(%)F_{EO}(\%)italic_F start_POSTSUBSCRIPT italic_E italic_O end_POSTSUBSCRIPT ( % ) 0.439 1.229 0.235 0.552 0.577 0.536 0.346 0.490 1.846 0.699 0.407 0.237
FMEO(%)F_{MEO}(\%)italic_F start_POSTSUBSCRIPT italic_M italic_E italic_O end_POSTSUBSCRIPT ( % ) 4.386 8.307 13.078 3.098 4.736 5.470 3.188 14.663 16.001 3.461 3.344 1.956
FDP(%)F_{DP}(\%)italic_F start_POSTSUBSCRIPT italic_D italic_P end_POSTSUBSCRIPT ( % ) 18.248 19.691 18.446 18.282 18.822 16.182 18.770 23.542 24.163 18.306 18.288 18.040
FOAE(%)F_{OAE}(\%)italic_F start_POSTSUBSCRIPT italic_O italic_A italic_E end_POSTSUBSCRIPT ( % ) 3.509 5.659 5.351 2.217 2.201 4.044 1.847 6.505 5.105 1.365 1.847 1.132
Race FEO(%)F_{EO}(\%)italic_F start_POSTSUBSCRIPT italic_E italic_O end_POSTSUBSCRIPT ( % ) 10.863 19.921 24.002 7.052 7.282 11.602 6.311 30.947 24.015 6.948 6.439 4.039
FMEO(%)F_{MEO}(\%)italic_F start_POSTSUBSCRIPT italic_M italic_E italic_O end_POSTSUBSCRIPT ( % ) 1.695 3.028 8.931 1.319 1.025 1.090 0.854 5.818 6.964 2.838 0.809 0.781
FDP(%)F_{DP}(\%)italic_F start_POSTSUBSCRIPT italic_D italic_P end_POSTSUBSCRIPT ( % ) 6.242 6.724 6.264 6.357 6.340 5.905 6.257 6.260 5.030 6.249 6.140 6.098
FOAE(%)F_{OAE}(\%)italic_F start_POSTSUBSCRIPT italic_O italic_A italic_E end_POSTSUBSCRIPT ( % ) 1.028 1.619 3.948 1.017 0.710 0.934 0.635 4.966 3.652 2.610 0.606 0.506
Age FEO(%)F_{EO}(\%)italic_F start_POSTSUBSCRIPT italic_E italic_O end_POSTSUBSCRIPT ( % ) 4.116 6.080 12.888 3.696 2.827 3.116 2.479 15.252 8.382 7.361 2.171 1.587
FMEO(%)F_{MEO}(\%)italic_F start_POSTSUBSCRIPT italic_M italic_E italic_O end_POSTSUBSCRIPT ( % ) 7.113 9.999 14.667 4.739 7.320 9.731 4.606 17.606 19.303 5.316 4.708 2.604
FDP(%)F_{DP}(\%)italic_F start_POSTSUBSCRIPT italic_D italic_P end_POSTSUBSCRIPT ( % ) 20.675 20.963 20.114 20.492 20.242 19.112 20.704 24.366 25.892 20.373 19.940 20.402
FOAE(%)F_{OAE}(\%)italic_F start_POSTSUBSCRIPT italic_O italic_A italic_E end_POSTSUBSCRIPT ( % ) 6.174 9.181 7.711 3.692 3.744 6.498 3.061 11.802 6.035 2.641 3.174 1.830
Intersection FEO(%)F_{EO}(\%)italic_F start_POSTSUBSCRIPT italic_E italic_O end_POSTSUBSCRIPT ( % ) 24.520 42.330 49.075 16.699 16.257 25.983 13.932 68.449 47.016 14.539 14.118 8.618
Individual FIND(%)F_{IND}(\%)italic_F start_POSTSUBSCRIPT italic_I italic_N italic_D end_POSTSUBSCRIPT ( % ) 112.067 585.935 0.125 46.083 22.982 1.383 3.246 8.606 0.598 28.437 13.706 0.477
Avg-FRsubscript𝐹𝑅F_{R}italic_F start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT 6.824 9.529 7.706 5.941 6.647 5.471 4.353 9.941 9.235 6.118 3.765 2.059
Fairness↓ Avg-FMRsubscript𝐹𝑀𝑅F_{MR}italic_F start_POSTSUBSCRIPT italic_M italic_R end_POSTSUBSCRIPT 8.020 6.020 7.843 3.981
ACC(%) 97.639 95.404 93.719 98.229 98.274 97.978 98.635 90.229 96.087 97.316 98.543 99.079
AUC(%) 99.768 99.117 98.914 99.826 99.786 99.767 99.885 96.030 98.846 99.703 99.871 99.937
AP(%) 99.846 99.359 99.240 99.885 99.853 99.829 99.917 96.973 98.987 99.802 99.916 99.956
Utility↑ - EER(%) 2.388 4.794 5.829 1.741 1.610 2.134 1.365 10.680 4.656 2.701 1.365 1.212
Training Time / Epoch 1h35min 3h07min 3h26min 1h41min 1h37min 4h05min 5h10min 5h07min 1h36min 1h45min 1h38min 7h45min
Table 5: Overall performance. Top 3 values on each metric are highlighted in green, blue, and yellow.

Evaluation Metrics. To provide a comprehensive benchmarking, we consider 5 fairness metrics commonly used in fairness community [69, 70, 71, 72, 73] and 4 widely used utility metrics. For fairness metrics, we consider Demographic Parity (FDPsubscript𝐹𝐷𝑃F_{DP}italic_F start_POSTSUBSCRIPT italic_D italic_P end_POSTSUBSCRIPT[69, 70], Max Equalized Odds (FMEOsubscript𝐹𝑀𝐸𝑂F_{MEO}italic_F start_POSTSUBSCRIPT italic_M italic_E italic_O end_POSTSUBSCRIPT[72], Equal Odds (FEOsubscript𝐹𝐸𝑂F_{EO}italic_F start_POSTSUBSCRIPT italic_E italic_O end_POSTSUBSCRIPT[71], and Overall Accuracy Equality (FOAEsubscript𝐹𝑂𝐴𝐸F_{OAE}italic_F start_POSTSUBSCRIPT italic_O italic_A italic_E end_POSTSUBSCRIPT[72] for evaluating group (e.g., gender) and intersectional (e.g., individuals of a specific race and simultaneously a specific gender) fairness. We also use individual fairness (FINDsubscript𝐹𝐼𝑁𝐷F_{IND}italic_F start_POSTSUBSCRIPT italic_I italic_N italic_D end_POSTSUBSCRIPT[73, 74] (i.e., similar individuals should have similar predicted outcomes) for estimation. Fairness metrics definition can be found in Appendix B.3. To compare detectors’ performance clearly and fairly, we define the Average Fairness Rank (Avg-FRsubscript𝐹𝑅F_{R}italic_F start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT), which ranks each detector on each fairness metric and averages these ranks. We also define Avg-FMRsubscript𝐹𝑀𝑅F_{MR}italic_F start_POSTSUBSCRIPT italic_M italic_R end_POSTSUBSCRIPT for the average rank across methods within a model type. For utility metrics, we employ Accuracy (ACC), the Area Under the ROC Curve (AUC), Average Precision (AP), and Equal Error Rate (EER).

Results. Overall Performance. Table 5 reports the overall performance on our AI-Face test set. Our observations are: 1) Most detectors do not have fairness except for Fairness-enhanced detectors, which demonstrate relatively lower performance disparities. 2) The top 3 performing methods are PG-FDD [21], DAG-FDD [20], and UCF [16] according to Avg-FRsubscript𝐹𝑅F_{R}italic_F start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT. 3) According to Avg-FMRsubscript𝐹𝑀𝑅F_{MR}italic_F start_POSTSUBSCRIPT italic_M italic_R end_POSTSUBSCRIPT, Fairness-enhanced detectors demonstrate superior performance. Frequency detectors surpass both Spatial and Naive detectors. A possible reason is that frequency features are more focused on the forgery trace while weakening the demographic features. This highlights a potential avenue for future research to enhance detector fairness by integrating frequency features with fairness-enhanced algorithms. 4) 9 out of 12 detectors have an AUC higher than 99%, demonstrating our AI-Face dataset is significant for training AI-face detectors in resulting high utility. 5) PG-FDD demonstrates superior performance but has a long training time, which can be explored and addressed in the future.

Performance on Different Subsets. Fig. 4 demonstrates the intersectional FEOsubscript𝐹𝐸𝑂F_{EO}italic_F start_POSTSUBSCRIPT italic_E italic_O end_POSTSUBSCRIPT and AUC performance of detectors on each test subset (e.g., subsets originate from different generative methods). We observe that the fairness performance varies a lot among different generative methods in every detector. The largest bias on most detectors comes from detecting face images generated by STGAN [75] and Commercial Tools (CT), including DALLE2 [55], IF [55], and Midjourney [55]. Moreover, the stable utility demonstrates our dataset’s expansiveness and diversity, enabling effective training to detect AI-generated faces from various generative methods. Full evaluation results are in Appendix B.4.

Performance on Different Subgroups. We conduct an analysis of all detectors on intersectional subgroups: Male-White (M-W), Male-Black (M-B), Male-Asian (M-A), Male-Others (M-A), Female-White (F-W), Female-Black (F-B), Female-Asian (F-A), Female-Others (F-O). As shown in Fig. 5, it plots the ratios of FPR for each subgroup to a reference group (M-W). 1) It is clear that facial images of M-A, F-B, and F-A are more likely to be mistakenly detected as fake than facial images of M-W. 2) However, the FPR of M-W is higher than others in DAW-FDD. This highlights a challenge in algorithmic fairness methods: improving performance for minority groups can inadvertently raise the error rate for the majority group (e.g., M-W). See demographic distribution in Appendix A.2.1.

Refer to caption
Figure 4: Visualization of the intersectional FEOsubscript𝐹𝐸𝑂F_{EO}italic_F start_POSTSUBSCRIPT italic_E italic_O end_POSTSUBSCRIPT (%) and AUC (%) of detectors on different subsets. The smaller FEOsubscript𝐹𝐸𝑂F_{EO}italic_F start_POSTSUBSCRIPT italic_E italic_O end_POSTSUBSCRIPT polygon area represents better fairness. The larger AUC area means better utility.
Refer to caption
Figure 5: Ratios of FPR for each subgroup to reference subgroup Male-White (M-W). The blue line indicates the 50% margin above. Red and green bars indicate the above and below of the margin, respectively.

Fairness Robustness Evaluation. Images spread on public platforms usually undergo post-processing. Therefore, it is important to estimate the capability of detectors to preserve fairness robustness while handling distorted images. We apply 6 post-processing methods: Random Crop (RC) [76], Rotation (RT) [25], Brightness Contrast (BC) [25], Hue Saturation Value (HSV) [25], Gaussian Blur (GB) [25], and JEPG Compression (JC) [77] to the test images (see Appendix B.5 for more details). Fig. 6 shows each detector’s intersectional FEOsubscript𝐹𝐸𝑂F_{EO}italic_F start_POSTSUBSCRIPT italic_E italic_O end_POSTSUBSCRIPT and AUC performance changes after using post-processing. Our observations are: 1) These impairments tend to wash out forensic traces, to the point that detectors have significant performance degradation. 2) Recent Fairness-enhanced detectors struggle to maintain fairness when images undergo post-processing. 3) Transform-based models (i.e., ViT-B/16 [63] and UnivFD [67]) demonstrate stronger robustness compared with CNN-based models. 4) JEPG Compression and Gaussian Blur cause notably greater performance degradation compared to others. See Appendix B.6 for more robustness analysis with respect to different degrees of post-processing.

Refer to caption
Figure 6: Visualization of performance changes after post-processing. Shorter bar represents better robustness.
Dataset
A-DF-1.0 [19] DF-Platter [35] GenData [22]
Fairness(%)↓ Utility(%)↑ Fairness(%)↓ Utility(%)↑ Fairness(%)↓ Utility(%)↑
Model Type Detector FOAEsubscript𝐹𝑂𝐴𝐸F_{OAE}italic_F start_POSTSUBSCRIPT italic_O italic_A italic_E end_POSTSUBSCRIPT FEOsubscript𝐹𝐸𝑂F_{EO}italic_F start_POSTSUBSCRIPT italic_E italic_O end_POSTSUBSCRIPT AUC FOAEsubscript𝐹𝑂𝐴𝐸F_{OAE}italic_F start_POSTSUBSCRIPT italic_O italic_A italic_E end_POSTSUBSCRIPT FEOsubscript𝐹𝐸𝑂F_{EO}italic_F start_POSTSUBSCRIPT italic_E italic_O end_POSTSUBSCRIPT AUC FOAEsubscript𝐹𝑂𝐴𝐸F_{OAE}italic_F start_POSTSUBSCRIPT italic_O italic_A italic_E end_POSTSUBSCRIPT FEOsubscript𝐹𝐸𝑂F_{EO}italic_F start_POSTSUBSCRIPT italic_E italic_O end_POSTSUBSCRIPT AUC Avg-FRsubscript𝐹𝑅F_{R}italic_F start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT
Xception [61] 4.227(+3.956) 9.198(+8.759) 82.479(-17.289) 2.308(+2.037) 8.691(+8.252) 75.933(-23.835) 0.438(+0.167) 1.724(+1.285) 94.315(-5.453) 5.167
EfficientB4 [62] 3.689(+3.094) 17.017(+15.788) 61.436(-37.681) 4.459(+3.864) 10.191(+8.962) 63.871(-35.246) 0.001(-0.594) 3.621(+2.392) 87.522(-11.595) 8.000
Naive ViT-B/16 [63] 4.45(+4.028) 9.154(+8.919) 70.896(-28.018) 2.531(+2.109) 5.557(+5.322) 68.935(-29.979) 1.249(+0.827) 2.874(+2.639) 89.109(-9.805) 6.667
F3Net [64] 1.749(+1.663) 19.484(+18.932) 86.265(-13.561) 2.995(+2.909) 5.445(+4.893) 82.421(-17.405) 0.155(+0.069) 2.927(+2.375) 93.882(-5.944) 6.000
SPSL [65] 8.497(+8.309) 2.430(+1.853) 75.177(-24.609) 3.323(+3.135) 8.966(+8.389) 82.024(-17.762) 0.138(-0.050) 2.321(+1.744) 94.320(-5.466) 6.167
Frequency SRM [66] 3.708(+3.440) 1.169(+0.633) 65.779(-33.988) 4.976(+4.708) 33.702(+33.166) 72.777(-26.990) 1.545(+1.277) 2.378(+1.842) 94.130(-5.637) 8.000
UCF [16] 2.930(+2.761) 9.924(+9.578) 83.260(-16.625) 3.536(+3.367) 9.395(+9.049) 83.92(-15.965) 1.346(+1.177) 1.377(+1.031) 94.948(-4.937) 6.500
UnivFD [67] 14.149(+13.592) 1.833(+1.343) 65.810(-30.220) 7.686(+7.129) 11.701(+11.211) 69.483(-26.547) 0.903(+0.346) 2.227(+1.737) 85.965(-10.065) 8.167
Spatial CORE [68] 0.308(-0.669) 11.854(+10.008) 79.222(-19.624) 3.966(+2.989) 5.267(+3.421) 81.264(-17.582) 0.005(-0.972) 2.943(+1.097) 94.329(-4.517) 5.667
DAW-FDD [20] 5.040(+4.917) 4.993(+4.294) 80.308(-19.395) 2.577(+2.454) 7.253(+6.554) 78.562(-21.141) 0.205(+0.082) 2.708(+2.009) 93.876(-5.827) 6.000
DAG-FDD [20] 4.279(+4.087) 13.565(+13.158) 85.859(-14.012) 3.885(+3.693) 7.350(+6.943) 83.153(-16.718) 1.062(+0.870) 1.688(+1.281) 94.326(-5.545) 7.167
Fairness- enhanced PG-FDD [21] 4.263(+4.129) 11.077(+10.840) 81.174(-18.763) 1.984(+1.850) 4.715(+4.478) 84.572(-15.365) 1.205(+1.071) 1.159(+0.922) 94.962(-4.975) 4.500
Table 6: Fairness generalization results based on the gender attribute. The smallest performance changes (in parentheses) and the best performance are in bold and in red, respectively.

Fairness Generalization Evaluation. To evaluate detectors’ fairness generalization capability, we train them on AI-Face and test them on A-DF-1.0, DF-Platter, and GenData, none of which are part of AI-Face. Results on gender attribute in Table 6 show that: 1) According to Avg-FRsubscript𝐹𝑅F_{R}italic_F start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT, the top three methods excelling in fairness preservation are PG-FDD, Xception, and CORE. PG-FDD, specifically designed for fairness generalization, leads to overall performance. However, it does not excel in terms of performance changes compared with intra-domain test results from Table 5, indicating room for improvement in its generalization capabilities. 2) CORE is notable for demonstrating negative fairness performance changes on A-DF-1.0 and GenData, suggesting techniques within CORE that could be potentially explored to enhance fairness generalization. More results are in Appendix B.7.

Effect of Increasing Training Set Size. We randomly sample 20%, 40%, 60%, and 80% of each training subset from AI-Face to assess the impact of training size on performance. Key observations from Fig. 7: 1) The performance of UnivFD changes slightest and cannot be improved with the increasing of data size.

Refer to caption
Figure 7: Impact of training set size on detectors’ intersectional FEOsubscript𝐹𝐸𝑂F_{EO}italic_F start_POSTSUBSCRIPT italic_E italic_O end_POSTSUBSCRIPT(%) and AUC (%).

2) Overall, detectors’ performance improves with larger training size, though few show fluctuations (e.g., ViT-B/16 and CORE). 3) A larger training set may improve utility but not always fairness. For example, Xception and SRM show increased utility when training size grows from 60% to 80%, but fairness worsens. Similar trends are observed in DAG-FDD and SPSL when the training set size increases from 40% to 60%. See Appendix B.8 for full results.

Discussion. According to the above experiments, we summarize the unsolved fairness problems in recent detectors: 1) Detectors’ fairness is unstable when detecting face images generated by different generative methods, indicating a future direction for enhancing fairness stability since new generative models continue to emerge. 2) Even though fairness-enhanced detectors exhibit small overall fairness metrics, they still show biased detection towards minority groups. Future studies should be more cautious when designing fair detectors to ensure balanced performance across all demographic groups. 3) There is currently no reliable detector, as all detectors experience severe large performance degradation under image post-processing and cross-domain evaluation. Future studies should aim to develop a unified framework that ensures fairness, robustness, and generalization, as these three characteristics are essential for creating a reliable detector.

5 Conclusion

This work presents the first demographically annotated million-scale AI-Face dataset, serving as a pivotal foundation for addressing the urgent need for developing fair AI face detectors. Based on our AI-Face dataset, we conduct the first comprehensive fairness benchmark, shedding light on the fairness performance and challenges of current representative AI face detectors. Our findings can inspire and guide researchers in refining current models and exploring new methods to mitigate bias. Limitation and Future Work: One limitation is that age annotations in our AI-Face dataset have relatively lower accuracy as the age attribute is often too ambiguous to predict. We will improve our annotator’s accuracy in predicting age attributes in the future. Additionally, we plan to extend our fairness benchmark to evaluate large language models like LLaMA2 [78] and GPT4 [79] for detecting AI faces. Social Impact: Malicious users could misuse AI-generated face images from our dataset to create fake social media profiles and spread misinformation. To mitigate this risk, only users who submit a signed end-user license agreement (EULA) will be granted access to our dataset.

Acknowledgment

This work is supported by the U.S. National Science Foundation (NSF) under grant IIS-2348419 and the National Artificial Intelligence Research Resource (NAIRR) Pilot and TACC Lonestar6. The views, opinions and/or findings expressed are those of the author and should not be interpreted as representing the official views or policies of NSF and NAIRR Pilot.

References

  • [1] Li Lin, Neeraj Gupta, Yue Zhang, Hainan Ren, Chun-Hao Liu, Feng Ding, Xin Wang, Xin Li, Luisa Verdoliva, and Shu Hu. Detecting multimedia generated by large ai models: A survey. arXiv preprint arXiv:2402.00045, 2024.
  • [2] Andreas Rossler, Davide Cozzolino, Luisa Verdoliva, Christian Riess, Justus Thies, and Matthias Nießner. Faceforensics++: Learning to detect manipulated facial images. In Proceedings of the IEEE/CVF international conference on computer vision, pages 1–11, 2019.
  • [3] Deepfakes github. https://github.com/deepfakes/faceswap. Accessed: 2024-04-17.
  • [4] Fakeapp. https://www.fakeapp.com/. Accessed: 2024-04-17.
  • [5] Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale gan training for high fidelity natural image synthesis. In 7th International Conference on Learning Representations, ICLR 2019, 2019.
  • [6] Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4401–4410, 2019.
  • [7] Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Analyzing and improving the image quality of stylegan. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8110–8119, 2020.
  • [8] Tero Karras, Miika Aittala, Samuli Laine, Erik Härkönen, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Alias-free generative adversarial networks. Advances in neural information processing systems, 34:852–863, 2021.
  • [9] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
  • [10] Daniel J Tojin T. Eapen. How generative ai can augment human creativity. https://hbr.org/2023/07/how-generative-ai-can-augment-human-creativity, 2023. Accessed: 2024-04-21.
  • [11] BBC News. Trump supporters target black voters with faked ai images. https://www.bbc.com/news/world-us-canada-68440150, 2024. Accessed: 2023-05-09.
  • [12] Henrik Skaug Sætra. Generative ai: Here to stay, but for good? Technology in Society, 75:102372, 2023.
  • [13] Mika Westerlund. The emergence of deepfake technology: A review. Technology innovation management review, 9(11), 2019.
  • [14] Wenbo Pu, Jing Hu, Xin Wang, Yuezun Li, Shu Hu, Bin Zhu, Rui Song, Qi Song, Xi Wu, and Siwei Lyu. Learning a deep dual-level network for robust deepfake detection. Pattern Recognition, 130:108832, 2022.
  • [15] Hui Guo, Shu Hu, Xin Wang, Ming-Ching Chang, and Siwei Lyu. Robust attentive deep neural network for detecting gan-generated faces. IEEE Access, 10:32574–32583, 2022.
  • [16] Zhiyuan Yan, Yong Zhang, Yanbo Fan, and Baoyuan Wu. Ucf: Uncovering common features for generalizable deepfake detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 22412–22423, 2023.
  • [17] Lorenzo Papa, Lorenzo Faiella, Luca Corvitto, Luca Maiano, and Irene Amerini. On the use of stable diffusion for creating realistic faces: from generation to detection. In 2023 11th International Workshop on Biometrics and Forensics (IWBF), pages 1–6. IEEE, 2023.
  • [18] Loc Trinh and Yan Liu. An examination of fairness of ai models for deepfake detection. IJCAI, 2021.
  • [19] Ying Xu, Philipp Terhöst, Marius Pedersen, and Kiran Raja. Analyzing fairness in deepfake detection with massively annotated databases. IEEE Transactions on Technology and Society, 2024.
  • [20] Yan Ju, Shu Hu, Shan Jia, George H Chen, and Siwei Lyu. Improving fairness in deepfake detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 4655–4665, 2024.
  • [21] Li Lin, Xinan He, Yan Ju, Xin Wang, Feng Ding, and Shu Hu. Preserving fairness generalization in deepfake detection. CVPR, 2024.
  • [22] Christopher Teo, Milad Abdollahzadeh, and Ngai-Man Man Cheung. On measuring fairness in generative models. Advances in Neural Information Processing Systems, 36, 2024.
  • [23] Chuqiao Li, Zhiwu Huang, Danda Pani Paudel, Yabin Wang, Mohamad Shahbazi, Xiaopeng Hong, and Luc Van Gool. A continual deepfake detection benchmark: Dataset, methods, and essentials. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1339–1349, 2023.
  • [24] Jingyi Deng, Chenhao Lin, Pengbin Hu, Chao Shen, Qian Wang, Qi Li, and Qiming Li. Towards benchmarking and evaluating deepfake detection. IEEE Transactions on Dependable and Secure Computing, 2024.
  • [25] Zhiyuan Yan, Yong Zhang, Xinhang Yuan, Siwei Lyu, and Baoyuan Wu. Deepfakebench: A comprehensive benchmark of deepfake detection. In NeurIPS, 2023.
  • [26] Binh M Le, Jiwon Kim, Shahroz Tariq, Kristen Moore, Alsharif Abuadbba, and Simon S Woo. Sok: Facial deepfake detectors. arXiv, 2024.
  • [27] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  • [28] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. Advances in neural information processing systems, 27, 2014.
  • [29] Midjourney. https://mid-journey.ai/. Accessed: 2024-04-17.
  • [30] Aditya Ramesh et al. Hierarchical text-conditional image generation with clip latents. arXiv, 1(2):3, 2022.
  • [31] Donie O’Sullivan. A high school student created a fake 2020 us candidate. twitter verified it. https://cnn.it/3HpHfzz, 2020. Accessed: 2024-04-21.
  • [32] Shannon Bond. That smiling linkedin profile face might be a computer-generated fake. https://www.npr.org/2022/03/27/1088140809/fake-linkedin-profiles, 2022. Accessed: 2024-04-21.
  • [33] Liming Jiang, Ren Li, Wayne Wu, Chen Qian, and Chen Change Loy. Deeperforensics-1.0: A large-scale dataset for real-world face forgery detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2889–2898, 2020.
  • [34] Kartik Narayan, Harsh Agarwal, Kartik Thakral, Surbhi Mittal, Mayank Vatsa, and Richa Singh. Deephy: On deepfake phylogeny. In 2022 IEEE International Joint Conference on Biometrics (IJCB), pages 1–10. IEEE, 2022.
  • [35] Kartik Narayan, Harsh Agarwal, Kartik Thakral, Surbhi Mittal, Mayank Vatsa, and Richa Singh. Df-platter: multi-face heterogeneous deepfake dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9739–9748, 2023.
  • [36] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), December 2015.
  • [37] Dong Yi, Zhen Lei, Shengcai Liao, and Stan Z Li. Learning face representation from scratch. arXiv, 2014.
  • [38] Rasmus Rothe, Radu Timofte, and Luc Van Gool. Dex: Deep expectation of apparent age from a single image. In Proceedings of the IEEE international conference on computer vision workshops, pages 10–15, 2015.
  • [39] Brian Dolhansky, Joanna Bitton, Ben Pflaum, Jikuo Lu, Russ Howes, Menglin Wang, and Cristian Canton Ferrer. The deepfake detection challenge (dfdc) dataset. arXiv preprint arXiv:2006.07397, 2020.
  • [40] Google Research. Contributing data to deepfake detection research, 2019. Accessed: 2024-04-12.
  • [41] Yuezun Li, Xin Yang, Pu Sun, Honggang Qi, and Siwei Lyu. Celeb-df: A large-scale challenging dataset for deepfake forensics. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3207–3216, 2020.
  • [42] Megvii Technology Limited. Face++ Face Detection. https://www.faceplusplus.com/face-detection/. Accessed: 2024-03.
  • [43] InsightFace Project Contributors. InsightFace: State-of-the-Art Face Analysis Toolbox. https://insightface.ai/. Accessed: 2024-03.
  • [44] Gabriel Ilharco, Mitchell Wortsman, Ross Wightman, Cade Gordon, Nicholas Carlini, Rohan Taori, Achal Dave, Vaishaal Shankar, Hongseok Namkoong, John Miller, Hannaneh Hajishirzi, Ali Farhadi, and Ludwig Schmidt. Open clip. https://github.com/mlfoundations/open_clip, 2021.
  • [45] Pierre Foret, Ariel Kleiner, Hossein Mobahi, and Behnam Neyshabur. Sharpness-aware minimization for efficiently improving generalization. In International Conference on Learning Representations, 2020.
  • [46] Philipp Terhörst, Marco Huber, Jan Niklas Kolf, Ines Zelch, Naser Damer, Florian Kirchbuchner, and Arjan Kuijper. Reliable age and gender estimation from face images: Stating the confidence of model predictions. In 2019 IEEE 10th International Conference on Biometrics Theory, Applications and Systems (BTAS), pages 1–8. IEEE, 2019.
  • [47] Qiong Cao, Li Shen, Weidi Xie, Omkar M Parkhi, and Andrew Zisserman. Vggface2: A dataset for recognising faces across pose and age. In 2018 13th IEEE international conference on automatic face & gesture recognition (FG 2018), pages 67–74. IEEE, 2018.
  • [48] Philipp Terhörst, Daniel Fährmann, Jan Niklas Kolf, Naser Damer, Florian Kirchbuchner, and Arjan Kuijper. Maad-face: A massively annotated attribute dataset for face images. IEEE Transactions on Information Forensics and Security, 16:3942–3957, 2021.
  • [49] Oliver Giudice, Luca Guarnera, and Sebastiano Battiato. Fighting deepfakes by detecting gan dct anomalies. Journal of Imaging, 7(8):128, 2021.
  • [50] Vishal Asnani, Xi Yin, Tal Hassner, and Xiaoming Liu. Reverse engineering of generative models: Inferring model hyperparameters from generated images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
  • [51] David Beniaguev. Synthetic faces high quality (sfhq) dataset, 2022.
  • [52] Zeyu Lu, Di Huang, Lei Bai, Jingjing Qu, Chengyue Wu, Xihui Liu, and Wanli Ouyang. Seeing is not always believing: Benchmarking human and model perception of ai-generated images. Advances in Neural Information Processing Systems, 36, 2024.
  • [53] L Minh Dang, Syed Ibrahim Hassan, Suhyeon Im, Jaecheol Lee, Sujin Lee, and Hyeonjoon Moon. Deep learning based computer generated face identification using convolutional neural network. Applied Sciences, 8(12):2610, 2018.
  • [54] Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12873–12883, 2021.
  • [55] Zhendong Wang, Jianmin Bao, Wengang Zhou, Weilun Wang, Hezhen Hu, Hong Chen, and Houqiang Li. Dire for diffusion-generated image detection. arXiv preprint arXiv:2303.09295, 2023.
  • [56] Minchul Kim, Feng Liu, Anil Jain, and Xiaoming Liu. Dcface: Synthetic face generation with dual condition diffusion model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12715–12725, 2023.
  • [57] Riccardo Corvi, Davide Cozzolino, Giada Zingarini, Giovanni Poggi, Koki Nagano, and Luisa Verdoliva. On the detection of synthetic images generated by diffusion models. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023.
  • [58] Md Awsafur Rahman, Bishmoy Paul, Najibul Haque Sarker, Zaber Ibn Abdul Hakim, and Shaikh Anowarul Fattah. Artifact: A large-scale dataset with artificial and factual images for generalizable and robust synthetic image detection. arXiv e-prints, pages arXiv–2302, 2023.
  • [59] Haixu Song, Shiyu Huang, Yinpeng Dong, and Wei-Wei Tu. Robustness and generalizability of deepfake detection: A study with diffusion models. arXiv preprint arXiv:2309.02218, 2023.
  • [60] Jiankang Deng, Jia Guo, Evangelos Ververas, Irene Kotsia, and Stefanos Zafeiriou. Retinaface: Single-shot multi-level face localisation in the wild. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5203–5212, 2020.
  • [61] François Chollet. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1251–1258, 2017.
  • [62] Mingxing Tan and Quoc Le. Efficientnet: Rethinking model scaling for convolutional neural networks. In International conference on machine learning, pages 6105–6114. PMLR, 2019.
  • [63] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. In 9th International Conference on Learning Representations, 2021.
  • [64] Yuyang Qian, Guojun Yin, Lu Sheng, Zixuan Chen, and Jing Shao. Thinking in frequency: Face forgery detection by mining frequency-aware clues. In European conference on computer vision, pages 86–103. Springer, 2020.
  • [65] Honggu Liu, Xiaodan Li, Wenbo Zhou, Yuefeng Chen, Yuan He, Hui Xue, Weiming Zhang, and Nenghai Yu. Spatial-phase shallow learning: rethinking face forgery detection in frequency domain. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 772–781, 2021.
  • [66] Yuchen Luo, Yong Zhang, Junchi Yan, and Wei Liu. Generalizing face forgery detection with high-frequency features. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16317–16326, 2021.
  • [67] Utkarsh Ojha, Yuheng Li, and Yong Jae Lee. Towards universal fake image detectors that generalize across generative models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24480–24489, 2023.
  • [68] Yunsheng Ni, Depu Meng, Changqian Yu, Chengbin Quan, Dongchun Ren, and Youjian Zhao. Core: Consistent representation learning for face forgery detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12–21, 2022.
  • [69] Xiaotian Han, Jianfeng Chi, Yu Chen, Qifan Wang, Han Zhao, Na Zou, and Xia Hu. Ffb: A fair fairness benchmark for in-processing group fairness methods. In ICLR, 2024.
  • [70] Ninareh Mehrabi, Fred Morstatter, Nripsuta Saxena, Kristina Lerman, and Aram Galstyan. A survey on bias and fairness in machine learning. ACM computing surveys (CSUR), 54(6):1–35, 2021.
  • [71] Jialu Wang, Xin Eric Wang, and Yang Liu. Understanding instance-level impact of fairness constraints. In International Conference on Machine Learning, pages 23114–23130. PMLR, 2022.
  • [72] Hao Wang, Luxi He, Rui Gao, and Flavio P Calmon. Aleatoric and epistemic discrimination in classification. ICML, 2023.
  • [73] Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard Zemel. Fairness through awareness. In Proceedings of the 3rd innovations in theoretical computer science conference, pages 214–226, 2012.
  • [74] Shu Hu and George H Chen. Fairness in survival analysis with distributionally robust optimization. arXiv, 2023.
  • [75] Ming Liu, Yukang Ding, Min Xia, Xiao Liu, Errui Ding, Wangmeng Zuo, and Shilei Wen. Stgan: A unified selective transfer network for arbitrary image attribute editing. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3673–3682, 2019.
  • [76] Federico Cocchi, Lorenzo Baraldi, Samuele Poppi, Marcella Cornia, Lorenzo Baraldi, and Rita Cucchiara. Unveiling the impact of image transformations on deepfake detection: An experimental analysis. In International Conference on Image Analysis and Processing, pages 345–356. Springer, 2023.
  • [77] Davide Cozzolino, Giovanni Poggi, Riccardo Corvi, Matthias Nießner, and Luisa Verdoliva. Raising the bar of ai-generated image detection with clip. arXiv preprint arXiv:2312.00195, 2023.
  • [78] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  • [79] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  • [80] Ying Xu et al. A comprehensive analysis of ai biases in deepfake detection with massively annotated databases. arXiv, 2022.
  • [81] Eran Eidinger, Roee Enbar, and Tal Hassner. Age and gender estimation of unfiltered faces. IEEE Transactions on information forensics and security, 9(12):2170–2179, 2014.
  • [82] Robert Williamson and Aditya Menon. Fairness risk measures. In International conference on machine learning, pages 6786–6797. PMLR, 2019.
  • [83] Daniel Levy, Yair Carmon, John C Duchi, and Aaron Sidford. Large-scale methods for distributionally robust optimization. Advances in Neural Information Processing Systems, 33:8847–8860, 2020.
  • [84] R Tyrrell Rockafellar, Stanislav Uryasev, et al. Optimization of conditional value-at-risk. Journal of risk, 2:21–42, 2000.
  • [85] Tatsunori Hashimoto, Megha Srivastava, Hongseok Namkoong, and Percy Liang. Fairness without demographics in repeated loss minimization. In International Conference on Machine Learning, pages 1929–1938. PMLR, 2018.
  • [86] John C Duchi and Hongseok Namkoong. Learning models with uniform performance via distributionally robust optimization. The Annals of Statistics, 49(3):1378–1406, 2021.
  • [87] Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé Iii, and Kate Crawford. Datasheets for datasets. Communications of the ACM, 64(12):86–92, 2021.

Appendix

Appendix A The Details of Demographically Annotated AI-Face Dataset

A.1 Phase1: Annotator Development

A.1.1 Annotator Implementation Details

For developing the annotator, all experiments are based on the PyTorch with a single NVIDIA RTX A6000 GPU. For training, we fix the batch size 64, epochs 32, and use Adam optimizer with an initial learning rate β=1e3𝛽1𝑒3\beta=1e-3italic_β = 1 italic_e - 3. Additionally, we employ a Cosine Annealing Learning Rate Scheduler to modulate the learning rate adaptively across the training duration. The hyperparameter γ𝛾\gammaitalic_γ in SAM optimization is set as 0.05. For uncertainty estimation, k𝑘kitalic_k and ρ𝜌\rhoitalic_ρ in uncertainty score V(X(a))𝑉superscript𝑋𝑎V(X^{(a)})italic_V ( italic_X start_POSTSUPERSCRIPT ( italic_a ) end_POSTSUPERSCRIPT ) are set as 100 and 0.2, respectively.

A.1.2 Details of Threshold Settings for Sample Difficulty Level

For Q2, Setting: According to the distribution as shown in Appendix A.2.2, for VGGFace2 [47], A-DFDC [80], and A-DFD [80] test set, the threshold t1Gendersuperscriptsubscript𝑡1𝐺𝑒𝑛𝑑𝑒𝑟t_{1}^{Gender}italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_G italic_e italic_n italic_d italic_e italic_r end_POSTSUPERSCRIPT and t2Gendersuperscriptsubscript𝑡2𝐺𝑒𝑛𝑑𝑒𝑟t_{2}^{Gender}italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_G italic_e italic_n italic_d italic_e italic_r end_POSTSUPERSCRIPT are set as 0.25 and 0.4, respectively. And t1Agesuperscriptsubscript𝑡1𝐴𝑔𝑒t_{1}^{Age}italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_A italic_g italic_e end_POSTSUPERSCRIPT and t2Agesuperscriptsubscript𝑡2𝐴𝑔𝑒t_{2}^{Age}italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_A italic_g italic_e end_POSTSUPERSCRIPT are set as 0.3 and 0.5, respectively. The threshold for gender attribute is more strict than age because gender attribute prediction is a relatively easier task than age, as well as reflecting from the distribution. For A-FF++ [80] and A-Celeb-DF-v2 [80], we adjust the threshold t1Gendersuperscriptsubscript𝑡1𝐺𝑒𝑛𝑑𝑒𝑟t_{1}^{Gender}italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_G italic_e italic_n italic_d italic_e italic_r end_POSTSUPERSCRIPT to 0.21 and t2Gendersuperscriptsubscript𝑡2𝐺𝑒𝑛𝑑𝑒𝑟t_{2}^{Gender}italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_G italic_e italic_n italic_d italic_e italic_r end_POSTSUPERSCRIPT to 0.25 in order to get sufficient 1,500 images in each sample difficulty level subset, especially for ‘Hard’ level.

A.1.3 Additional Annotator Evaluation Results

From Table 7 to Table 11 are comparison results of our annotator against baselines InsightFace [43] and Face++[42] on detailed attributes. The findings and results align with the results in Table 3 of the submitted manuscript. For cross-domain evaluation, we additionally choose Adience [81] dataset, where images are manually annotated, consisting of over 26.5k real images of over 2.2k different individuals in unconstrained environments, to further validate the effectiveness and good generalization capability of our annotator. Results in table 12 demonstrate our annotator outperforms InsightFace [43] and Face++[42] again. Overall, one intra-domain dataset (VGGFace2) and five cross-domain datasets (A-FF++, A-DFDC, A-DFD, A-Celeb-DF-v2, and Adience) all validate that our annotator’s superior performance against current state-of-the-art face attribute prediction tools Face++ [42] and InsightFace [43].

Level Method VGGFace2 [47]
Female Male Young Middle_Aged Senior
precision recall F1 precision recall F1 precision recall F1 precision recall F1 precision recall F1
77.042 83.597 80.060 79.571 72.314 75.584 80.201 34.649 46.991 46.049 63.692 53.327 66.266 77.261 71.320
Face++ [42] (0.363) (0.789) (0.444) (0.780) (0.516) (0.453) (2.448) (1.182) (1.278) (0.832) (1.588) (0.976) (1.030) (1.188) (0.955)
76.560 78.062 77.281 76.946 75.395 76.139 78.730 30.533 42.907 41.660 56.653 47.936 61.426 76.107 67.966
InsightFace [43] (0.533) (0.737) (0.534) (0.629) (0.555) (0.474) (2.135) (0.961) (1.052) (1.005) (1.429) (1.078) (0.649) (1.212) (0.724)
81.452 89.467 85.158 87.267 78.329 82.401 95.619 80.467 86.659 91.189 75.027 81.665 88.043 76.720 81.747
All Ours (0.413) (0.331) (0.323) (0.393) (0.572) (0.438) (0.565) (1.343) (0.629) (0.812) (0.880) (0.585) (0.982) (1.388) (0.837)
97.482 96.742 97.108 96.697 97.439 97.064 89.360 57.507 69.964 58.673 72.205 64.729 79.536 89.546 84.240
Face++ [42] (0.604) (0.549) (0.329) (0.545) (0.642) (0.355) (0.670) (1.825) (1.403) (0.837) (1.397) (0.699) (1.119) (0.600) (0.687)
97.625 96.373 96.994 96.420 97.653 97.032 88.544 50.080 63.957 53.146 65.720 58.762 73.657 88.200 80.271
InsightFace [43] (0.403) (0.229) (0.124) (0.206) (0.410) (0.135) (0.917) (1.831) (1.585) (0.621) (1.001) (0.497) (0.808) (0.748) (0.555)
99.575 99.893 99.734 99.893 99.573 99.733 99.720 99.760 99.740 99.279 99.000 99.139 99.551 96.480 97.988
Easy Ours (0.176) (0.100) (0.126) (0.100) (0.177) (0.127) (0.098) (0.080) (0.049) (0.267) (0.400) (0.164) (0.431) (0.688) (0.204)
72.977 81.336 76.927 78.435 69.245 73.549 76.257 24.427 36.996 41.197 61.838 49.446 62.616 73.719 67.710
Face++ [42] (0.292) (1.246) (0.679) (1.101) (0.681) (0.618) (2.732) (0.709) (1.057) (0.961) (2.458) (1.472) (1.171) (0.503) (0.759)
73.710 75.360 74.521 74.807 73.120 73.950 74.222 22.080 34.033 37.372 53.480 43.995 58.070 73.840 65.008
InsightFace [43] (0.699) (1.316) (0.907) (1.070) (0.766) (0.753) (2.488) (0.588) (0.928) (1.208) (2.160) (1.535) (0.282) (1.216) (0.497)
82.518 95.253 88.428 94.389 79.813 86.489 96.648 84.200 89.987 95.382 75.920 84.540 87.960 76.800 81.993
Medium Ours (0.621) (0.482) (0.439) (0.544) (0.858) (0.583) (0.627) (1.730) (0.111) (0.604) (1.017) (0.650) (0.915) (1.544) (1.018)
60.667 72.714 66.146 63.582 50.259 56.140 74.987 22.012 34.013 38.276 57.032 45.807 56.647 68.517 62.009
Face++ [42] (0.194) (0.571) (0.323) (0.694) (0.226) (0.385) (3.942) (1.012) (1.374) (0.697) (0.910) (0.758) (0.799) (2.462) (1.420)
58.345 62.453 60.329 59.611 55.413 57.435 73.425 19.440 30.730 34.462 50.760 41.050 52.550 66.280 58.618
InsightFace [43] (0.498) (0.667) (0.570) (0.610) (0.489) (0.533) (2.999) (0.463) (0.642) (1.187) (1.127) (1.202) (0.858) (1.671) (1.121)
62.263 73.255 67.312 67.518 55.600 60.982 90.490 57.440 70.249 78.905 50.160 61.315 76.617 56.880 65.260
Hard Ours (0.443) (0.410) (0.403) (0.534) (0.680) (0.604) (0.971) (2.218) (1.726) (1.564) (1.222) (0.941) (1.600) (1.933) (1.288)
Table 7: Detailed comparison of our annotator against Face++ [42] and InsightFace [43] on VGGFace2 dataset. ‘All’ denotes the averaged metrics across three levels of sample difficulty: ‘Easy,’ ‘Medium,’ and ‘Hard.’ Prediction mean and standard deviation (in parentheses) of each method across 5 random samplings and testings within each level are reported. The best results are shown in Bold.
Level Method A-FF++ [80]
Female Male Young Middle_Aged Senior
precision recall F1 precision recall F1 precision recall F1 precision recall F1 precision recall F1
90.106 88.345 89.129 88.560 90.007 89.201 74.707 72.839 73.547 52.934 81.935 63.981 84.769 37.933 51.193
Face++ [42] (0.282) (0.400) (0.220) (0.323) (0.295) (0.188) (1.264) (0.581) (0.656) (0.790) (1.288) (0.852) (0.922) (1.705) (1.884)
87.918 81.284 84.344 82.787 88.662 85.533 67.148 83.284 74.317 48.659 74.220 58.725 93.455 20.959 33.854
InsightFace [43] (0.458) (0.861) (0.560) (0.625) (0.472) (0.399) (1.429) (0.427) (0.959) (0.658) (1.532) (0.831) (1.228) (1.708) (2.315)
89.851 94.149 91.866 93.177 88.451 90.641 90.888 87.477 88.689 88.461 71.439 78.337 93.390 66.519 77.023
All Ours (0.290) (0.302) (0.202) (0.280) (0.384) (0.226) (0.632) (0.280) (0.396) (1.556) (1.064) (1.012) (0.686) (1.461) (1.138)
98.157 97.947 98.052 97.950 98.159 98.054 85.686 93.398 89.372 69.170 89.520 78.036 95.646 58.880 72.888
Face++ [42] (0.302) (0.136) (0.146) (0.131) (0.308) (0.151) (0.625) (1.072) (0.676) (0.878) (0.431) (0.651) (0.371) (0.546) (0.427)
97.004 96.640 96.820 96.656 97.013 96.833 76.172 98.120 85.762 61.246 86.840 71.828 98.106 28.800 44.524
InsightFace [43] (0.470) (0.605) (0.387) (0.579) (0.482) (0.380) (1.303) (0.665) (1.016) (0.576) (1.039) (0.519) (0.755) (0.633) (0.718)
97.987 99.920 98.944 99.919 97.947 98.923 91.450 100.000 95.530 99.494 94.320 96.838 95.154 85.022 89.802
Easy Ours (0.222) (0.065) (0.092) (0.067) (0.233) (0.097) (0.862) (0.000) (0.470) (0.170) (0.483) (0.286) (0.500) (0.724) (0.478)
98.839 89.590 93.985 90.605 98.960 94.597 84.212 80.092 82.098 48.990 86.710 62.604 91.594 25.400 39.746
Face++ [42] (0.312) (0.691) (0.294) (0.553) (0.285) (0.230) (0.742) (0.379) (0.333) (0.591) (0.585) (0.520) (0.670) (1.688) (2.102)
95.655 79.813 87.017 82.685 96.373 89.005 69.090 85.800 76.542 45.982 73.680 56.622 96.654 15.040 26.022
InsightFace [43] (0.499) (0.798) (0.546) (0.573) (0.433) (0.408) (0.862) (0.358) (0.579) (0.273) (0.985) (0.435) (0.636) (0.794) (1.194)
98.907 98.827 98.866 98.828 98.907 98.867 96.968 97.748 97.356 95.636 79.488 86.812 96.102 63.732 76.632
Medium Ours (0.227) (0.131) (0.103) (0.128) (0.229) (0.104) (0.311) (0.230) (0.158) (0.900) (0.673) (0.415) (0.399) (1.483) (1.134)
73.323 77.498 75.352 77.123 72.901 74.952 54.224 45.026 49.172 40.642 69.576 51.302 67.066 29.518 40.946
Face++ [42] (0.231) (0.374) (0.220) (0.284) (0.292) (0.184) (2.425) (0.291) (0.960) (0.901) (2.847) (1.385) (1.727) (2.881) (3.122)
71.096 67.400 69.194 69.020 72.600 70.762 56.182 65.932 60.648 38.748 62.140 47.724 85.604 19.036 31.016
72.658 83.700 77.787 80.784 68.500 74.133 84.246 64.684 73.180 70.252 40.508 51.360 88.914 50.804 64.634
Hard Ours (0.419) (0.710) (0.411) (0.645) (0.691) (0.477) (0.722) (0.610) (0.561) (3.600) (2.035) (2.334) (1.160) (2.175) (1.803)
Table 8: Detailed comparison of our annotator against Face++ [42] and InsightFace [43] on A-FF++ dataset. ‘All’ denotes the averaged metrics across three levels of sample difficulty: ‘Easy,’ ‘Medium,’ and ‘Hard.’ Prediction mean and standard deviation (in parentheses) of each method across 5 random samplings and testings within each level are reported. The best results are shown in Bold.
Level Method A-DFDC [19]
Female Male Young Middle_Aged Senior
precision recall F1 precision recall F1 precision recall F1 precision recall F1 precision recall F1
74.751 83.324 78.453 78.824 68.797 72.922 74.499 81.417 77.479 58.696 74.613 65.608 85.533 52.741 63.542
Face++ [42] (0.675) (0.948) (0.603) (0.933) (0.929) (0.732) (0.567) (1.186) (0.772) (0.688) (0.420) (0.496) (0.792) (1.559) (1.469)
72.691 64.662 68.376 68.160 75.560 71.623 71.209 86.658 77.985 58.102 76.777 66.086 83.061 36.659 49.620
InsightFace [43] (0.649) (1.208) (0.671) (0.779) (0.754) (0.453) (0.852) (0.703) (0.676) (1.040) (0.370) (0.797) (0.989) (1.802) (1.833)
74.805 88.720 80.839 85.137 67.864 74.834 88.669 76.413 81.695 91.753 77.845 83.772 90.797 77.147 82.823
All Ours (0.588) (0.587) (0.498) (0.622) (0.848) (0.619) (0.633) (1.103) (0.752) (0.736) (0.670) (0.545) (0.618) (1.356) (1.116)
94.124 89.917 91.965 90.351 94.367 92.309 86.586 96.596 91.316 73.340 85.560 78.974 99.440 71.324 83.066
Face++ [42] (0.977) (1.010) (0.564) (0.835) (1.025) (0.535) (0.691) (0.713) (0.489) (0.853) (0.933) (0.622) (0.255) (1.013) (0.768)
91.874 78.693 84.753 81.389 93.013 86.801 73.580 95.760 83.216 62.690 77.960 69.496 94.034 42.760 58.776
InsightFace [43] (0.986) (1.798) (0.744) (1.140) (1.059) (0.384) (0.821) (0.196) (0.552) (0.659) (0.898) (0.746) (1.262) (1.209) (1.189)
95.289 97.413 96.337 97.353 95.173 96.248 97.448 99.240 98.334 94.902 96.996 95.938 99.730 89.844 94.528
Easy Ours (1.019) (0.136) (0.566) (0.157) (1.098) (0.621) (0.635) (0.233) (0.359) (0.030) (0.540) (0.280) (0.168) (0.433) (0.303)
70.438 81.415 75.529 77.764 65.533 71.124 70.980 64.526 67.584 53.830 69.930 60.828 73.342 58.332 64.972
Face++ [42] (0.334) (0.881) (0.528) (0.836) (0.418) (0.457) (0.703) (2.147) (1.379) (0.578) (0.000) (0.372) (0.815) (1.562) (1.158)
69.126 66.733 67.905 67.859 70.200 69.006 73.794 77.866 75.768 54.900 71.330 62.038 68.248 44.000 53.474
InsightFace [43] (0.316) (1.285) (0.813) (0.788) (0.221) (0.345) (1.105) (1.205) (0.966) (1.229) (0.000) (0.780) (0.711) (2.241) (1.769)
69.072 95.067 80.011 92.097 57.433 70.746 89.568 67.934 77.250 94.220 69.246 79.824 82.306 82.014 82.158
Medium Ours (0.350) (0.680) (0.444) (1.027) (0.512) (0.593) (0.325) (2.048) (1.451) (1.257) (0.431) (0.524) (1.062) (0.908) (0.891)
59.692 78.639 67.866 68.357 46.489 55.334 65.930 83.128 73.536 48.918 68.350 57.022 83.816 28.568 42.588
Face++ [42] (0.715) (0.953) (0.716) (1.128) (1.344) (1.203) (0.307) (0.698) (0.449) (0.635) (0.326) (0.495) (1.307) (2.102) (2.481)
57.073 48.560 52.471 55.231 63.467 59.062 66.252 86.348 74.972 56.716 81.042 66.724 86.900 23.216 36.610
InsightFace [43] (0.647) (0.543) (0.456) (0.409) (0.983) (0.629) (0.631) (0.707) (0.511) (1.233) (0.211) (0.864) (0.995) (1.955) (2.540)
60.054 73.680 66.170 65.961 50.987 57.509 78.990 62.066 69.500 86.136 67.294 75.554 90.354 59.582 71.784
Hard Ours (0.394) (0.945) (0.485) (0.683) (0.934) (0.645) (0.940) (1.029) (0.447) (0.922) (1.038) (0.832) (0.624) (2.726) (2.154)
Table 9: Detailed comparison of our annotator against Face++ [42] and InsightFace [43] on A-DFDC dataset. ‘All’ denotes the averaged metrics across three levels of sample difficulty: ‘Easy,’ ‘Medium,’ and ‘Hard.’ Prediction mean and standard deviation (in parentheses) of each method across 5 random samplings and testings within each level are reported. The best results are shown in Bold.
Level Method A-DFD[19]
Female Male
precision recall F1 precision recall F1
74.375 62.743 67.925 68.258 78.854 73.096
Face++ [42] (1.442) (1.445) (1.256) (1.197) (1.544) (1.228)
71.967 51.796 59.975 63.600 81.636 71.405
InsightFace [43] (1.062) (1.161) (1.053) (0.651) (0.863) (0.660)
72.884 83.547 77.548 78.938 66.418 71.615
All Ours (0.580) (0.753) (0.538) (0.750) (0.979) (0.783)
95.014 82.284 88.179 84.398 95.677 89.676
Face++ [42] (0.377) (1.876) (1.011) (1.378) (0.391) (0.686)
94.405 75.573 83.944 79.639 95.520 86.858
InsightFace [43] (0.777) (0.956) (0.794) (0.683) (0.634) (0.596)
94.434 93.600 94.013 93.659 94.480 94.066
Easy Ours (0.418) (0.566) (0.307) (0.517) (0.451) (0.295)
65.536 60.151 62.715 63.114 68.283 65.587
Face++ [42] (1.779) (1.249) (1.202) (1.077) (2.356) (1.566)
64.397 47.147 54.434 58.323 73.947 65.211
InsightFace [43] (1.071) (1.433) (1.310) (0.769) (0.646) (0.671)
65.158 86.133 74.188 79.536 53.920 64.258
Medium Ours (0.886) (0.625) (0.666) (0.911) (1.776) (1.472)
62.576 45.793 52.882 57.261 72.603 64.024
Face++ [42] (2.170) (1.210) (1.556) (1.137) (1.885) (1.433)
57.100 32.667 41.547 52.838 75.440 62.145
InsightFace [43] (1.339) (1.096) (1.054) (0.501) (1.310) (0.714)
59.062 70.907 64.442 63.618 50.853 56.520
Hard Ours (0.437) (1.068) (0.641) (0.820) (0.709) (0.582)
Table 10: Detailed comparison of our annotator against Face++ [42] and InsightFace [43] on A-DFD dataset. ‘All’ denotes the averaged metrics across three levels of sample difficulty: ‘Easy,’ ‘Medium,’ and ‘Hard.’ Prediction mean and standard deviation (in parentheses) of each method across 5 random samplings and testings within each level are reported. The best results are shown in Bold.
A-Celeb-DF-v2 [19]
Female Male
Level Method precision recall F1 precision recall F1
97.624 83.302 89.455 86.517 98.318 91.819
Face++ [42] (0.815) (0.727) (0.438) (0.494) (0.640) (0.453)
97.442 85.842 91.041 88.079 98.007 92.635
InsightFace [43] (0.537) (0.569) (0.317) (0.388) (0.464) (0.299)
96.381 93.611 94.921 94.147 96.687 95.355
All Ours (0.756) (0.564) (0.400) (0.422) (0.782) (0.426)
99.889 96.480 98.155 96.598 99.893 98.218
Face++ [42] (0.055) (0.418) (0.236) (0.392) (0.053) (0.222)
99.837 98.107 98.964 98.140 99.840 98.983
InsightFace [43] (0.054) (0.352) (0.180) (0.340) (0.053) (0.174)
100.000 99.973 99.987 99.973 100.000 99.987
Easy Ours (0.000) (0.053) (0.027) (0.053) (0.000) (0.027)
99.732 89.227 94.185 90.260 99.760 94.771
Face++ [42] (0.199) (0.952) (0.558) (0.787) (0.177) (0.460)
99.639 88.320 93.638 89.513 99.680 94.323
InsightFace [43] (0.226) (0.496) (0.354) (0.412) (0.200) (0.298)
99.760 99.760 99.760 99.760 99.760 99.760
Medium Ours (0.053) (0.177) (0.100) (0.176) (0.053) (0.100)
93.251 64.200 76.024 72.694 95.300 82.469
Face++ [42] (2.190) (0.812) (0.519) (0.304) (1.691) (0.679)
92.850 71.100 80.521 76.584 94.500 84.600
InsightFace [43] (1.331) (0.860) (0.417) (0.412) (1.140) (0.424)
89.384 81.100 85.015 82.707 90.300 86.319
Hard Ours (2.214) (1.463) (1.073) (1.037) (2.294) (1.150)
Table 11: Detailed comparison of our annotator against Face++ [42] and InsightFace [43] on A-Celeb-DF-v2 dataset. ‘All’ denotes the averaged metrics across three levels of sample difficulty: ‘Easy,’ ‘Medium,’ and ‘Hard.’ Prediction mean and standard deviation (in parentheses) of each method across 5 random samplings and testings within each level are reported. The best results are shown in Bold.
Level Method Adience [81]
Female Male Young Middle_Aged Senior
precision recall F1 precision recall F1 precision recall F1 precision recall F1 precision recall F1
71.800 87.124 76.457 84.701 71.257 76.322 89.823 68.558 77.632 51.442 70.678 58.668 50.280 76.456 60.540
Face++ [42] (0.900) (0.743) (0.797) (0.732) (0.863) (0.601) (0.444) (1.024) (0.696) (1.569) (1.397) (1.313) (3.051) (3.228) (2.199)
67.003 74.180 68.487 73.848 67.877 69.512 86.510 37.559 51.425 37.028 65.519 45.545 26.775 72.412 38.946
InsightFace [43] (0.687) (1.140) (0.650) (1.068) (0.722) (0.583) (1.147) (1.079) (1.225) (0.888) (1.304) (0.884) (1.672) (2.321) (2.045)
78.103 94.416 83.852 96.915 82.496 88.602 77.517 81.767 79.578 48.206 37.327 41.843 67.352 69.370 68.241
All Ours (0.732) (0.338) (0.603) (0.382) (0.607) (0.443) (0.500) (0.963) (0.579) (1.794) (0.529) (1.061) (3.613) (3.679) (3.258)
97.977 93.613 95.658 92.325 97.319 94.754 96.754 81.939 88.730 33.469 68.722 44.960 59.861 88.473 71.165
Face++ [42] (0.251) (0.646) (0.273) (0.660) (0.344) (0.284) (0.253) (0.746) (0.434) (2.274) (1.985) (2.095) (4.706) (4.331) (2.485)
96.613 89.815 93.087 88.180 96.009 91.925 97.076 58.063 72.662 19.981 70.808 31.150 26.572 84.972 40.411
InsightFace [43] (0.210) (1.000) (0.537) (0.980) (0.331) (0.540) (0.455) (0.819) (0.717) (0.942) (2.276) (1.185) (2.943) (2.722) (3.521)
99.822 99.974 99.898 99.969 99.776 99.872 93.651 97.159 95.372 57.920 37.414 45.375 97.721 96.831 97.221
Easy Ours (0.104) (0.052) (0.078) (0.063) (0.124) (0.094) (0.599) (0.507) (0.455) (2.428) (0.217) (1.469) (3.063) (2.986) (2.037)
82.280 87.626 84.863 72.271 63.091 67.350 90.557 64.622 75.413 57.229 71.100 63.397 47.362 75.972 58.295
Face++ [42] (1.064) (0.710) (0.617) (1.249) (1.654) (0.996) (0.379) (1.444) (1.015) (1.735) (1.580) (1.350) (2.070) (3.031) (1.800)
78.900 75.366 77.087 55.589 60.493 57.923 83.417 30.372 44.510 40.618 61.066 48.776 26.303 65.071 37.453
InsightFace [43] (1.250) (0.827) (0.818) (0.852) (1.375) (0.637) (2.029) (2.139) (2.587) (0.934) (0.713) (0.618) (1.300) (2.274) (1.616)
92.671 99.184 95.816 98.159 84.639 90.897 76.159 80.899 78.455 40.740 34.599 37.409 69.317 69.799 69.532
Medium Ours (0.607) (0.280) (0.385) (0.599) (0.735) (0.505) (0.372) (0.913) (0.492) (1.307) (0.808) (0.812) (5.848) (5.737) (5.645)
35.144 80.134 48.851 89.507 53.361 66.860 82.159 59.114 68.752 63.628 72.212 67.646 43.617 64.923 52.162
Face++ [42] (1.384) (0.874) (1.502) (0.287) (0.591) (0.522) (0.701) (0.882) (0.638) (0.697) (0.626) (0.495) (2.376) (2.322) (2.311)
25.495 57.358 35.287 77.776 47.129 58.688 79.037 24.242 37.102 50.485 64.681 56.708 27.452 67.194 38.973
InsightFace [43] (0.602) (1.594) (0.596) (1.373) (0.460) (0.572) (0.956) (0.280) (0.372) (0.789) (0.923) (0.848) (0.773) (1.967) (0.999)
41.818 84.089 55.842 92.619 63.072 75.038 62.742 67.243 64.907 45.958 39.968 42.745 35.018 41.479 37.972
Hard Ours (1.484) (0.684) (1.345) (0.486) (0.962) (0.730) (0.530) (1.468) (0.790) (1.648) (0.563) (0.902) (1.929) (2.314) (2.092)
Table 12: Detailed comparison of our annotator against Face++ [42] and InsightFace [43] on Adience dataset. ‘All’ denotes the averaged metrics across three levels of sample difficulty: ‘Easy,’ ‘Medium,’ and ‘Hard.’ Prediction mean and standard deviation (in parentheses) of each method across 5 random samplings and testings within each level are reported. The best results are shown in Bold.

A.2 Phase2: Demographically Annotation Generation

A.2.1 Detailed Information of Datasets

Methods #Samples FFHQ CASIA-WebFace IMDB-WIKI CelebA A-FF+ A-DFDC A-DFD A-Celeb-DF-v2
 [6]  [37]  [38]  [36] (Real) [80] (Real)  [80] (Real) [80] (Real) [80]
A-FF++  [2] 105K
A-DFDC  [39] 37K
A-DFD  [40] 31K
A-Celeb-DF-v2  [41] 155K
AttGAN [49] 6K
MMDGAN [50] 1K
StarGAN [49] 5.6K
StyleGAN [49] 10K
StyleGAN2 [51] 118K
StyleGAN3 [52] 26.7K
MSG-StyleGAN [50] 1K
ProGAN [53] 100K
STGAN [50] 1K
VQGAN [54] 50K
DALLE2 [55] 204
IF [55] 505
Midjourney [55] 100
DCFace [56] 529K
Latent Diffusion [57] 20K
Palette [58] 6K
SD v1.5 [59] 18K
SD Inpainting [59] 20.9K
Total 1,245,660 70,000 474,876 26,788 202,502 21,593 37,836 8,856 23,645
866,096
Table 13: Number of real and fake images from different fake image datasets and their corresponding real image sources.

Table 13 shows the detailed information of all subsets we collected and incorporated into our AI-Face dataset. It covers fake facial images from deepfake videos, generated from GANs and DMs. The corresponding real sources of most AI-generated face subsets are FFHQ [6] and CelebA [36]. In general, our AI-Face dataset contains 30 subsets (22 fake subsets and 8 real subsets) and 37 generation methods ( methods are summed as 5 in A-FF++, 5 in A-DFD, 8 in A-DFDC, 1 in A-Celeb-DF-v2, 10 GANs, and 8 DMs), including a total of 1,245,660 fake face images and 866,096 real face images. Fig. A.1 visualizes face images of each subset. Fig. A.2 further demonstrates the detailed demographic distribution of our AI-Face dataset. The dataset is relatively gender-balanced, and the subjects are majorly young and white individuals.

Refer to caption
Figure A.1: Visualization of images in AI-Face dataset. SD is short for Stable Diffusion.
Refer to caption
Figure A.2: Detailed demographic distribution of AI-Face dataset.

A.2.2 Details of Threshold Settings for Human Correction

In this section, we present uncertainty score distributions of each attribute (i.e., Gender, Age, and Race) of each subset in our AI-Face dataset, as shown from Fig. A.4 to Fig. A.31. Overall, our annotator shows higher confidence in predicting gender attributes compared to predicting age, as observed from these uncertainty score distributions. It is clear that different subsets show different distributions, so we dynamically adjust the threshold tajsuperscript𝑡subscript𝑎𝑗t^{{a}_{j}}italic_t start_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUPERSCRIPT for each attribute a𝑎aitalic_a on subset j𝑗jitalic_j defined in ‘Human Correction’ in Section 3.2. First, we fit the distribution with gamma distribution and calculate its mean and standard deviation. Then, the tajsuperscript𝑡subscript𝑎𝑗t^{{a}_{j}}italic_t start_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUPERSCRIPT is calculated using Mean+λStd𝑀𝑒𝑎𝑛𝜆𝑆𝑡𝑑Mean+\lambda Stditalic_M italic_e italic_a italic_n + italic_λ italic_S italic_t italic_d. After getting the threshold, we can get the total image number within each subset that needed human correction. We assume it takes three seconds for a human to correct one annotation for one image, then we can calculate the total time needed for a human to correct these images beyond the threshold. Therefore, The λ𝜆\lambdaitalic_λ is dynamically adjusted based on the distribution and the total time needed for human correction.

A.2.3 Examples of Mislabeled images in A-FF++ and A-DFDC

In the evaluation results for Q1 in Section 3.2, we have validated that we cannot directly incorporate existing annotations into our AI-Face dataset. Fig. A.3 displays some image examples where annotations in A-FF++ [80] and A-DFDC [80] are inconsistent with the annotations given by our annotator. A-FF and A-DFDC have mislabeled annotations for ambiguous facial images, whereas our annotator can accurately predict them. This visualization of images further validates that existing annotations cannot be directly merged into our dataset.

A.2.4 Additional Results of Validating the Effectiveness of Human Correction Strategy

Since A-DFD [19] and A-Celeb-DF-v2 [19] provide gender annotation, we can compare our two versions of datasets with it. One is Ours before human correction (i.e., Ours-DFD(w/o Correction) and Ours-A-Celeb-DF-v2(w/o Correction)), another one is Ours after human correction (i.e., Ours-DFD (Correction) and Ours-A-Celeb-DF-v2 (Correction)). As same setting as in evaluation for Q2 Section 3.2, we sample 1,200 attribute-balanced images (400 easy, 400 medium, and 400 hard) based on uncertainty score. Three humans re-annotated these images to establish ground truth. As shown in Table 14, Ours-DFD (Correction) and Ours-A-Celeb-DF-v2 (Correction) outperforms ours without correction version and A-DFD and A-Celeb-DF-v2 (e.g., the accuracy of Ours-DFD (Correction) is 22.866% higher than A-DFD and 13.526% higher than Ours-DFD(w/o Correction)). This suggests that our dataset annotation quality is much better than the existing annotation in A-DFD [19] and A-Celeb-DF-v2 [19]. And our human correction strategy further improves our dataset annotation quality.

Gender
ACC Precision Recall F1
A-DFD [19] 70.612 71.347 74.245 69.900
Ours-DFD(w/o Correction) 79.952 79.868 83.979 79.308
Ours-DFD (Correction) 93.478 91.673 95.034 92.898
A-Celeb-DF-v2 [19] 89.697 90.622 90.622 89.697
Ours-A-Celeb-DF-v2(w/o Correction) 91.414 91.404 91.831 91.391
Ours-A-Celeb-DF-v2 (Correction) 93.535 93.655 94.087 93.525
Table 14: Evaluation results to demonstrate the effectiveness of human correction strategy.
Refer to caption
Figure A.3: Examples of face images where annotations in A-FF++ and A-DFDC are inconsistent with Ours-FF++ (w/o Correction) and Ours-DFDC (w/o Correction).
Refer to caption
Figure A.4: Uncertainty Score histograms for gender, age and race for A-FF++  [2] dataset.
Refer to caption
Figure A.5: Uncertainty Score histograms for gender, age and race for A-DFDC [39] dataset.
Refer to caption
Figure A.6: Uncertainty Score histograms for gender, age and race for A-DFD [40] dataset.
Refer to caption
Figure A.7: Uncertainty Score histograms for gender, age and race for A-Celeb-DF-v2 [41] dataset.
Refer to caption
Figure A.8: Uncertainty Score histograms for gender, age and race for AttGAN [49] dataset.
Refer to caption
Figure A.9: Uncertainty Score histograms for gender, age and race for MMDGAN [50] dataset.
Refer to caption
Figure A.10: Uncertainty Score histograms for gender, age and race for StarGAN [49] dataset.
Refer to caption
Figure A.11: Uncertainty Score histograms for gender, age and race for StyleGAN [49] dataset.
Refer to caption
Figure A.12: Uncertainty Score histograms for gender, age and race for StyleGAN2  [51] dataset.
Refer to caption
Figure A.13: Uncertainty Score histograms for gender, age and race for StyleGAN3 [52] dataset.
Refer to caption
Figure A.14: Uncertainty Score histograms for gender, age and race for MSG-StyleGAN [50] dataset.
Refer to caption
Figure A.15: Uncertainty Score histograms for gender, age and race for ProGAN [53] dataset.
Refer to caption
Figure A.16: Uncertainty Score histograms for gender, age and race for STGAN [50] dataset.
Refer to caption
Figure A.17: Uncertainty Score histograms for gender, age and race for VQGAN [54] dataset.
Refer to caption
Figure A.18: Uncertainty Score histograms for gender, age and race for DALLE2 [55], IF [55], Midjourney [55] dataset.
Refer to caption
Figure A.19: Uncertainty Score histograms for gender, age and race for DCFace [56] dataset.
Refer to caption
Figure A.20: Uncertainty Score histograms for gender, age and race for Latent Diffusion [57] dataset.
Refer to caption
Figure A.21: Uncertainty Score histograms for gender, age and race for Palette [58] dataset.
Refer to caption
Figure A.22: Uncertainty Score histograms for gender, age and race for SD v1.5 [59] dataset.
Refer to caption
Figure A.23: Uncertainty Score histograms for gender, age and race for SD Inpainting  [59] dataset.
Refer to caption
Figure A.24: Uncertainty Score histograms for gender, age and race for FFHQ [6] dataset.
Refer to caption
Figure A.25: Uncertainty Score histograms for gender, age and race for CASIA-WebFace [37] dataset.
Refer to caption
Figure A.26: Uncertainty Score histograms for gender, age and race for IMDB-WIKI [38] dataset.
Refer to caption
Figure A.27: Uncertainty Score histograms for gender, age and race for CelebA [36] dataset.
Refer to caption
Figure A.28: Uncertainty Score histograms for gender, age and race for A-FF++ [2] real dataset.
Refer to caption
Figure A.29: Uncertainty Score histograms for gender, age and race for A-DFDC [39] real dataset.
Refer to caption
Figure A.30: Uncertainty Score histograms for gender, age and race for A-DFD [40] real dataset.
Refer to caption
Figure A.31: Uncertainty Score histograms for gender, age and race for A-Celeb-DF-v2 [41] real dataset.

Appendix B Fairness Benchmark

B.1 Details of Detection Methods

Model Type Detector Backbone GitHub Link VENUE
Naive Xception [61] Xception https://github.com/ondyari/FaceForensics/blob/master ICCV-2019
Efficient-B4 [62] EfficientNet https://github.com/lukemelas/EfficientNet-PyTorch ICML-2019
ViT-B/16 [63] Transformer https://github.com/lucidrains/vit-pytorch ICLR-2021
Spatial UCF [16] Xception https://github.com/SCLBD/DeepfakeBench/tree/main ICCV-2023
UnivFD [67] CLIP VIT https://github.com/Yuheng-Li/UniversalFakeDetect CVPR-2023
CORE [68] Xception https://github.com/niyunsheng/CORE CVPRW-2022
Frequency F3Net [64] Xception https://github.com/yyk-wew/F3Net ECCV-2020
SRM [66] Xception https://github.com/SCLBD/DeepfakeBench/tree/main CVPR-2021
SPSL [65] Xception https://github.com/SCLBD/DeepfakeBench/tree/main CVPR-2021
Fairness- enhanced DAW-FDD [20] Xception Unpublished code, reproduced by us WACV-2024
DAG-FDD [20] Xception Unpublished code, reproduced by us WACV-2024
PG-FDD [21] Xception https://github.com/Purdue-M2/Fairness-Generalization CVPR-2024
Table 15: Summary of the implemented detectors in our fairness benchmark.

Xception [61]: is a deep convolutional neural network (CNN) architecture that relies on depthwise separable convolutions. This approach significantly reduces the number of parameters and computational cost while maintaining high performance. Xception serves as a classic backbone in deepfake detectors.

EfficientB4 [62]: is part of the EfficientNet family [62], which utilizes a novel model scaling method that uniformly scales all dimensions of depth, width, and resolution using a compound coefficient. EfficientNet also serves as a classic backbone in deepfake detectors.

ViT-B/16 [63]: is a model that applies the transformer architecture, the ’B’ denotes the base model size, and ’16’ indicates the patch size. ViT-B/16 splits images into 16 patches, linearly embeds each patch, adds positional embeddings, and feeds the resulting sequence of vectors into a standard transformer encoder.

F3Net [64]: utilizes a cross-attention two-stream network to effectively identify frequency-aware clues by integrating two branches: FAD and LFS. The FAD (Frequency-aware Decomposition) module divides the input image into various frequency bands using learnable partitions, representing the image with frequency-aware components to detect forgery patterns through this decomposition. Meanwhile, the LFS (Localized Frequency Statistics) module captures local frequency statistics to highlight statistical differences between authentic and counterfeit faces.

SPSL [65]: integrates spatial image data with the phase spectrum to detect up-sampling artifacts in face forgeries, enhancing the model’s generalization ability for face forgery detection. The paper provides a theoretical analysis of the effectiveness of using the phase spectrum. Additionally, it highlights that local texture information is more important than high-level semantic information for accurately detecting face forgeries.

SRM [66]: extracts high-frequency noise features and combines two different representations from the RGB and frequency domains to enhance the model’s generalization ability for face forgery detection.

UCF [16]: presents a multi-task disentanglement framework designed to tackle two key challenges in deepfake detection: overfitting to irrelevant features and overfitting to method-specific textures. By identifying and leveraging common features, this framework aims to improve the model’s generalization ability.

UnivFD [67]: uses the frozen CLIP ViT-L/14 [44] as feature extractor and trains the last linear layer to classify fake and real images.

CORE [68]: explicitly enforces the consistency of different representations. It first captures various representations through different augmentations and then regularizes the cosine distance between these representations to enhance their consistency.

DAW-FDD [20]: a demographic-aware Fair Deepfake Detection (DAW-FDD) method leverages demographic information and employs an existing fairness risk measure [82]. At a high level, DAW-FDD aims to ensure that the losses achieved by different user-specified groups of interest (e.g., different races or genders) are similar to each other (so that the AI face detector is not more accurate on one group vs another) and, moreover, that the losses across all groups are low. Specifically, DAW-FDD uses a CVaR [83, 84] loss function across groups (to address imbalance in demographic groups) and, per group, DAW-FDD uses another CVaR loss function (to address imbalance in real vs AI-generated training examples).

DAG-FDD [20]: a demographic-agnostic Fair Deepfake Detection (DAG-FDD) method, which is based on the distributionally robust optimization (DRO) [85, 86]. To use DAG-FDD, the user does not have to specify which attributes to treat as sensitive such as race and gender, only need to specify a probability threshold for a minority group without explicitly identifying all possible groups.

PG-FDD [21]: PG-FDD (Preserving Generalization Fair Deepfake Detection) employs disentanglement learning to extract demographic and domain-agnostic forgery features, promoting fair learning across a flattened loss landscape. Its framework combines disentanglement learning, fairness learning, and optimization modules. The disentanglement module introduces a loss to expose demographic and domain-agnostic features that enhance fairness generalization. The fairness learning module combines these features to promote fair learning, guided by generalization principles. The optimization module flattens the loss landscape, helping the model escape suboptimal solutions and strengthen fairness generalization.

B.2 Implementation Details

For fairness benchmark, all experiments are based on the PyTorch with a single NVIDIA RTX A6000 GPU. During training, we utilize SGD optimizer with a learning rate of 0.0005, with momentum of 0.9 and weight decay of 0.005. The batch size is set to 128 for most detectors. However, for the SRM [66], UCF [16], and PG-FDD [21], the batch size is adjusted to 32 due to GPU memory. For hyperparameters defined in these detectors, we use the default values set in their original papers. All detectors are initialized with their official pre-trained weights, and trained for 5 epochs.

B.3 Fairness Metrics

We assume a test set comprising indices {1, …, n𝑛nitalic_n}. Yjsubscript𝑌𝑗Y_{j}italic_Y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT and Y^jsubscript^𝑌𝑗\hat{Y}_{j}over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT respectively represent the true and predicted labels of the sample Xjsubscript𝑋𝑗X_{j}italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. Their values are binary, where 0 means real and 1 means fake. For all fairness metrics, a lower value means better performance.

FEO:=𝒥j𝒥q=01|j=1n𝕀[Y^j=1,Dj=𝒥j,Yj=q]j=1n𝕀[Dj=𝒥j,Yj=q]j=1n𝕀[Y^j=1,Yj=q]j=1n𝕀[Yj=q]|,assignsubscript𝐹𝐸𝑂subscriptsubscript𝒥𝑗𝒥superscriptsubscript𝑞01superscriptsubscript𝑗1𝑛subscript𝕀delimited-[]formulae-sequencesubscript^𝑌𝑗1formulae-sequencesubscript𝐷𝑗subscript𝒥𝑗subscript𝑌𝑗𝑞superscriptsubscript𝑗1𝑛subscript𝕀delimited-[]formulae-sequencesubscript𝐷𝑗subscript𝒥𝑗subscript𝑌𝑗𝑞superscriptsubscript𝑗1𝑛subscript𝕀delimited-[]formulae-sequencesubscript^𝑌𝑗1subscript𝑌𝑗𝑞superscriptsubscript𝑗1𝑛subscript𝕀delimited-[]subscript𝑌𝑗𝑞\displaystyle F_{EO}:=\sum_{\mathcal{J}_{j}\in\mathcal{J}}\sum_{q=0}^{1}\left|% \frac{\sum_{j=1}^{n}\mathbb{I}_{\left[\hat{Y}_{j}=1,D_{j}=\mathcal{J}_{j},Y_{j% }=q\right]}}{\sum_{j=1}^{n}\mathbb{I}_{\left[D_{j}=\mathcal{J}_{j},Y_{j}=q% \right]}}-\frac{\sum_{j=1}^{n}\mathbb{I}_{\left[\hat{Y}_{j}=1,Y_{j}=q\right]}}% {\sum_{j=1}^{n}\mathbb{I}_{\left[Y_{j}=q\right]}}\right|,italic_F start_POSTSUBSCRIPT italic_E italic_O end_POSTSUBSCRIPT := ∑ start_POSTSUBSCRIPT caligraphic_J start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ caligraphic_J end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_q = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT | divide start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT blackboard_I start_POSTSUBSCRIPT [ over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = 1 , italic_D start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = caligraphic_J start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = italic_q ] end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT blackboard_I start_POSTSUBSCRIPT [ italic_D start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = caligraphic_J start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = italic_q ] end_POSTSUBSCRIPT end_ARG - divide start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT blackboard_I start_POSTSUBSCRIPT [ over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = 1 , italic_Y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = italic_q ] end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT blackboard_I start_POSTSUBSCRIPT [ italic_Y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = italic_q ] end_POSTSUBSCRIPT end_ARG | ,
FOAE:=max𝒥j𝒥{j=1n𝕀[Y^j=Yj,Dj=𝒥j]j=1n𝕀[Dj=𝒥j]min𝒥j𝒥j=1n𝕀[Y^j=Yj,Dj=𝒥j]j=1n𝕀[Dj=𝒥j]},assignsubscript𝐹𝑂𝐴𝐸subscriptsubscript𝒥𝑗𝒥superscriptsubscript𝑗1𝑛subscript𝕀delimited-[]formulae-sequencesubscript^𝑌𝑗subscript𝑌𝑗subscript𝐷𝑗subscript𝒥𝑗superscriptsubscript𝑗1𝑛subscript𝕀delimited-[]subscript𝐷𝑗subscript𝒥𝑗subscriptsuperscriptsubscript𝒥𝑗𝒥superscriptsubscript𝑗1𝑛subscript𝕀delimited-[]formulae-sequencesubscript^𝑌𝑗subscript𝑌𝑗subscript𝐷𝑗superscriptsubscript𝒥𝑗superscriptsubscript𝑗1𝑛subscript𝕀delimited-[]subscript𝐷𝑗superscriptsubscript𝒥𝑗\displaystyle F_{O\!A\!E}:=\max_{\mathcal{J}_{j}\in\mathcal{J}}\left\{\frac{% \sum_{j=1}^{n}\mathbb{I}_{[\hat{Y}_{j}=Y_{j},D_{j}=\mathcal{J}_{j}]}}{\sum_{j=% 1}^{n}\mathbb{I}_{[D_{j}=\mathcal{J}_{j}]}}\right.\quad\left.-\min_{{\mathcal{% J}_{j}}^{\prime}\in\mathcal{J}}\frac{\sum_{j=1}^{n}\mathbb{I}_{[\hat{Y}_{j}=Y_% {j},D_{j}={\mathcal{J}_{j}}^{\prime}]}}{\sum_{j=1}^{n}\mathbb{I}_{[D_{j}={% \mathcal{J}_{j}}^{\prime}]}}\right\},italic_F start_POSTSUBSCRIPT italic_O italic_A italic_E end_POSTSUBSCRIPT := roman_max start_POSTSUBSCRIPT caligraphic_J start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ caligraphic_J end_POSTSUBSCRIPT { divide start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT blackboard_I start_POSTSUBSCRIPT [ over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = italic_Y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_D start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = caligraphic_J start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ] end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT blackboard_I start_POSTSUBSCRIPT [ italic_D start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = caligraphic_J start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ] end_POSTSUBSCRIPT end_ARG - roman_min start_POSTSUBSCRIPT caligraphic_J start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_J end_POSTSUBSCRIPT divide start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT blackboard_I start_POSTSUBSCRIPT [ over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = italic_Y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_D start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = caligraphic_J start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ] end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT blackboard_I start_POSTSUBSCRIPT [ italic_D start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = caligraphic_J start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ] end_POSTSUBSCRIPT end_ARG } ,
FDP:=maxq{0,1}{maxJj𝒥j=1n𝕀[Y^j=q,Dj=Jj]j=1n𝕀[Dj=Jj]minJj𝒥j=1n𝕀[Y^j=q,Dj=Jj]j=1n𝕀[Dj=Jj]},assignsubscript𝐹𝐷𝑃subscript𝑞01subscriptsubscript𝐽𝑗𝒥superscriptsubscript𝑗1𝑛subscript𝕀delimited-[]formulae-sequencesubscript^𝑌𝑗𝑞subscript𝐷𝑗subscript𝐽𝑗superscriptsubscript𝑗1𝑛subscript𝕀delimited-[]subscript𝐷𝑗subscript𝐽𝑗subscriptsuperscriptsubscript𝐽𝑗𝒥superscriptsubscript𝑗1𝑛subscript𝕀delimited-[]formulae-sequencesubscript^𝑌𝑗𝑞subscript𝐷𝑗superscriptsubscript𝐽𝑗superscriptsubscript𝑗1𝑛subscript𝕀delimited-[]subscript𝐷𝑗superscriptsubscript𝐽𝑗\displaystyle F_{DP}:=\max_{q\in\{0,1\}}\left\{\max_{J_{j}\in\mathcal{J}}\frac% {\sum_{j=1}^{n}\mathbb{I}_{[\hat{Y}_{j}=q,D_{j}=J_{j}]}}{\sum_{j=1}^{n}\mathbb% {I}_{[D_{j}=J_{j}]}}\right.\quad\left.-\min_{J_{j}^{\prime}\in\mathcal{J}}% \frac{\sum_{j=1}^{n}\mathbb{I}_{[\hat{Y}_{j}=q,D_{j}=J_{j}^{\prime}]}}{\sum_{j% =1}^{n}\mathbb{I}_{[D_{j}=J_{j}^{\prime}]}}\right\},italic_F start_POSTSUBSCRIPT italic_D italic_P end_POSTSUBSCRIPT := roman_max start_POSTSUBSCRIPT italic_q ∈ { 0 , 1 } end_POSTSUBSCRIPT { roman_max start_POSTSUBSCRIPT italic_J start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ caligraphic_J end_POSTSUBSCRIPT divide start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT blackboard_I start_POSTSUBSCRIPT [ over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = italic_q , italic_D start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = italic_J start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ] end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT blackboard_I start_POSTSUBSCRIPT [ italic_D start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = italic_J start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ] end_POSTSUBSCRIPT end_ARG - roman_min start_POSTSUBSCRIPT italic_J start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_J end_POSTSUBSCRIPT divide start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT blackboard_I start_POSTSUBSCRIPT [ over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = italic_q , italic_D start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = italic_J start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ] end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT blackboard_I start_POSTSUBSCRIPT [ italic_D start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = italic_J start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ] end_POSTSUBSCRIPT end_ARG } ,
FMEO:=maxq,q{0,1}{maxJj𝒥j=1n𝕀[Y^j=q,Yj=q,Dj=Jj]j=1n𝕀[Dj=Jj,Yj=q]minJj𝒥j=1n𝕀[Y^j=q,Yj=q,Dj=Jj]j=1n𝕀[Dj=Jj,Yj=q]},assignsubscript𝐹𝑀𝐸𝑂subscript𝑞superscript𝑞01subscriptsubscript𝐽𝑗𝒥superscriptsubscript𝑗1𝑛subscript𝕀delimited-[]formulae-sequencesubscript^𝑌𝑗𝑞formulae-sequencesubscript𝑌𝑗superscript𝑞subscript𝐷𝑗subscript𝐽𝑗superscriptsubscript𝑗1𝑛subscript𝕀delimited-[]formulae-sequencesubscript𝐷𝑗subscript𝐽𝑗subscript𝑌𝑗𝑞subscriptsuperscriptsubscript𝐽𝑗𝒥superscriptsubscript𝑗1𝑛subscript𝕀delimited-[]formulae-sequencesubscript^𝑌𝑗𝑞formulae-sequencesubscript𝑌𝑗superscript𝑞subscript𝐷𝑗superscriptsubscript𝐽𝑗superscriptsubscript𝑗1𝑛subscript𝕀delimited-[]formulae-sequencesubscript𝐷𝑗superscriptsubscript𝐽𝑗subscript𝑌𝑗𝑞\displaystyle F_{M\!E\!O}:=\max_{q,q^{\prime}\in\{0,1\}}\left\{\max_{J_{j}\in% \mathcal{J}}\frac{\sum_{j=1}^{n}\mathbb{I}_{[\hat{Y}_{j}=q,Y_{j}=q^{\prime},D_% {j}=J_{j}]}}{\sum_{j=1}^{n}\mathbb{I}_{[D_{j}=J_{j},Y_{j}=q]}}\right.\quad% \left.-\min_{J_{j}^{\prime}\in\mathcal{J}}\frac{\sum_{j=1}^{n}\mathbb{I}_{[% \hat{Y}_{j}=q,Y_{j}=q^{\prime},D_{j}=J_{j}^{\prime}]}}{\sum_{j=1}^{n}\mathbb{I% }_{[D_{j}=J_{j}^{\prime},Y_{j}=q]}}\right\},italic_F start_POSTSUBSCRIPT italic_M italic_E italic_O end_POSTSUBSCRIPT := roman_max start_POSTSUBSCRIPT italic_q , italic_q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ { 0 , 1 } end_POSTSUBSCRIPT { roman_max start_POSTSUBSCRIPT italic_J start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ caligraphic_J end_POSTSUBSCRIPT divide start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT blackboard_I start_POSTSUBSCRIPT [ over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = italic_q , italic_Y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = italic_q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_D start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = italic_J start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ] end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT blackboard_I start_POSTSUBSCRIPT [ italic_D start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = italic_J start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = italic_q ] end_POSTSUBSCRIPT end_ARG - roman_min start_POSTSUBSCRIPT italic_J start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_J end_POSTSUBSCRIPT divide start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT blackboard_I start_POSTSUBSCRIPT [ over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = italic_q , italic_Y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = italic_q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_D start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = italic_J start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ] end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT blackboard_I start_POSTSUBSCRIPT [ italic_D start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = italic_J start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_Y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = italic_q ] end_POSTSUBSCRIPT end_ARG } ,
FIND:=j=1nl=j+1n𝕀[|f(Xj)f(Xl)|δXjXl],assignsubscript𝐹𝐼𝑁𝐷superscriptsubscript𝑗1𝑛superscriptsubscript𝑙𝑗1𝑛subscript𝕀delimited-[]𝑓subscript𝑋𝑗𝑓subscript𝑋𝑙𝛿normsubscript𝑋𝑗subscript𝑋𝑙\displaystyle F_{IND}:=\sum_{j=1}^{n}\sum_{l=j+1}^{n}\mathbb{I}_{[\left|f(X_{j% })-f(X_{l})\right|-\delta\|X_{j}-X_{l}\|]},italic_F start_POSTSUBSCRIPT italic_I italic_N italic_D end_POSTSUBSCRIPT := ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_l = italic_j + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT blackboard_I start_POSTSUBSCRIPT [ | italic_f ( italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) - italic_f ( italic_X start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) | - italic_δ ∥ italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_X start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∥ ] end_POSTSUBSCRIPT ,
Avg-FR:=1|F|fFRm,f,Avg-FMR:=1|Mt|mMtAvg-FR.formulae-sequenceassignAvg-subscript𝐹𝑅1𝐹subscript𝑓𝐹subscript𝑅𝑚𝑓assignAvg-subscript𝐹𝑀𝑅1subscript𝑀𝑡subscript𝑚subscript𝑀𝑡Avg-subscript𝐹𝑅\displaystyle\text{Avg-}F_{R}:=\frac{1}{|F|}\sum_{f\in F}R_{m,f},\text{Avg-}F_% {MR}:=\frac{1}{|M_{t}|}\sum_{m\in M_{t}}\text{Avg-}F_{R}.Avg- italic_F start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT := divide start_ARG 1 end_ARG start_ARG | italic_F | end_ARG ∑ start_POSTSUBSCRIPT italic_f ∈ italic_F end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_m , italic_f end_POSTSUBSCRIPT , Avg- italic_F start_POSTSUBSCRIPT italic_M italic_R end_POSTSUBSCRIPT := divide start_ARG 1 end_ARG start_ARG | italic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT italic_m ∈ italic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT Avg- italic_F start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT .

Where D𝐷Ditalic_D is the demographic variable, 𝒥𝒥\mathcal{J}caligraphic_J is the set of subgroups with each subgroup 𝒥j𝒥subscript𝒥𝑗𝒥\mathcal{J}_{j}\in\mathcal{J}caligraphic_J start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ caligraphic_J. M𝑀Mitalic_M is the set of detection models and F𝐹Fitalic_F is the set of fairness metrics. Rm,fsubscript𝑅𝑚𝑓R_{m,f}italic_R start_POSTSUBSCRIPT italic_m , italic_f end_POSTSUBSCRIPT is the rank of detection model mM𝑚𝑀m\in Mitalic_m ∈ italic_M for fairness metric fF𝑓𝐹f\in Fitalic_f ∈ italic_F. |F|𝐹|F|| italic_F | is the total number of fairness metrics. T𝑇Titalic_T is the set of model types, and Mtsubscript𝑀𝑡M_{t}italic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the set of detection models within model type tT𝑡𝑇t\in Titalic_t ∈ italic_T. |Mt|subscript𝑀𝑡|M_{t}|| italic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | is the total number of detection models within model type t𝑡titalic_t. FEOsubscript𝐹𝐸𝑂F_{EO}italic_F start_POSTSUBSCRIPT italic_E italic_O end_POSTSUBSCRIPT measures the disparity in TPR or FPR between each subgroup and the overall population. FOAEsubscript𝐹𝑂𝐴𝐸F_{OAE}italic_F start_POSTSUBSCRIPT italic_O italic_A italic_E end_POSTSUBSCRIPT measures the maximum ACC gap across all demographic groups. FDPsubscript𝐹𝐷𝑃F_{DP}italic_F start_POSTSUBSCRIPT italic_D italic_P end_POSTSUBSCRIPT measures the maximum difference in prediction rates across all demographic groups. And FMEOsubscript𝐹𝑀𝐸𝑂F_{MEO}italic_F start_POSTSUBSCRIPT italic_M italic_E italic_O end_POSTSUBSCRIPT captures the largest disparity in prediction outcomes (either positive or negative) when comparing different demographic groups. δ𝛿\deltaitalic_δ in FINDsubscript𝐹𝐼𝑁𝐷F_{IND}italic_F start_POSTSUBSCRIPT italic_I italic_N italic_D end_POSTSUBSCRIPT is a predefined scale factor (0.06 in our experiments). f(Xj)𝑓subscript𝑋𝑗f(X_{j})italic_f ( italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) represents the predicted logits of the model for input sample Xjsubscript𝑋𝑗X_{j}italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. FINDsubscript𝐹𝐼𝑁𝐷F_{IND}italic_F start_POSTSUBSCRIPT italic_I italic_N italic_D end_POSTSUBSCRIPT points that a model should be fair across individuals if similar individuals have similar predicted outcomes. Avg-FRAvg-subscript𝐹𝑅\text{Avg-}F_{R}Avg- italic_F start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT is the average fairness rank of detection model m𝑚mitalic_m, Avg-FMRAvg-subscript𝐹𝑀𝑅\text{Avg-}F_{MR}Avg- italic_F start_POSTSUBSCRIPT italic_M italic_R end_POSTSUBSCRIPT is the average fairness rank of a model type.

B.4 Full Subsets Evaluation Results

Detailed test results of each subset as shown from Table 16 to Table 35 are presented in this section. The findings align with the results reported in Fig. 4.

Model Type
Naive Frequency Spatial Fairness-enhanced
Dataset Attribute Metric
Xception
 [61]
EfficientB4
 [62]
ViT-B/16
 [63]
F3Net
 [64]
SPSL
 [65]
SRM
 [66]
UCF
 [16]
UnivFD
 [67]
CORE
 [68]
DAW-FDD
 [20]
DAG-FDD
 [20]
PG-FDD
 [21]
FF++ Gender FMEOsubscript𝐹𝑀𝐸𝑂F_{MEO}italic_F start_POSTSUBSCRIPT italic_M italic_E italic_O end_POSTSUBSCRIPT 4.353 3.346 1.161 2.595 0.887 2.492 4.916 10.873 2.516 12.198 1.606 2.214
FDPsubscript𝐹𝐷𝑃F_{DP}italic_F start_POSTSUBSCRIPT italic_D italic_P end_POSTSUBSCRIPT 1.250 1.096 0.276 0.601 0.392 0.409 1.231 2.874 0.61 2.722 1.024 0.772
FOAEsubscript𝐹𝑂𝐴𝐸F_{OAE}italic_F start_POSTSUBSCRIPT italic_O italic_A italic_E end_POSTSUBSCRIPT 0.177 0.132 0.396 0.426 0.231 0.228 0.489 0.015 0.196 0.977 0.941 0.095
FEOsubscript𝐹𝐸𝑂F_{EO}italic_F start_POSTSUBSCRIPT italic_E italic_O end_POSTSUBSCRIPT 4.749 4.293 1.335 2.728 1.012 2.839 5.117 12.323 3.221 12.993 2.969 2.362
Race FMEOsubscript𝐹𝑀𝐸𝑂F_{MEO}italic_F start_POSTSUBSCRIPT italic_M italic_E italic_O end_POSTSUBSCRIPT 10.304 9.813 7.630 15.051 6.844 22.26 9.791 23.588 4.564 22.598 2.607 15.657
FDPsubscript𝐹𝐷𝑃F_{DP}italic_F start_POSTSUBSCRIPT italic_D italic_P end_POSTSUBSCRIPT 3.562 9.544 3.485 3.22 5.864 3.516 5.554 12.934 8.75 8.954 6.65 2.943
FOAEsubscript𝐹𝑂𝐴𝐸F_{OAE}italic_F start_POSTSUBSCRIPT italic_O italic_A italic_E end_POSTSUBSCRIPT 4.465 7.396 6.232 4.045 2.541 5.227 4.388 6.889 3.522 7.382 1.939 3.764
FEOsubscript𝐹𝐸𝑂F_{EO}italic_F start_POSTSUBSCRIPT italic_E italic_O end_POSTSUBSCRIPT 17.066 27.404 12.835 20.586 11.277 36.221 17.288 70.499 11.944 48.644 6.882 18.386
Age FMEOsubscript𝐹𝑀𝐸𝑂F_{MEO}italic_F start_POSTSUBSCRIPT italic_M italic_E italic_O end_POSTSUBSCRIPT 9.851 5.348 5.984 6.204 3.622 14.005 9.692 15.205 9.423 24.413 1.857 6.136
FDPsubscript𝐹𝐷𝑃F_{DP}italic_F start_POSTSUBSCRIPT italic_D italic_P end_POSTSUBSCRIPT 2.887 4.708 1.280 5.661 6.479 7.919 6.196 9.205 7.693 4.346 6.221 4.339
FOAEsubscript𝐹𝑂𝐴𝐸F_{OAE}italic_F start_POSTSUBSCRIPT italic_O italic_A italic_E end_POSTSUBSCRIPT 1.038 5.813 6.417 2.049 0.856 1.581 2.606 8.927 1.138 4.263 1.472 2.112
FEOsubscript𝐹𝐸𝑂F_{EO}italic_F start_POSTSUBSCRIPT italic_E italic_O end_POSTSUBSCRIPT 18.191 11.876 8.199 13.665 5.636 20.696 17.291 29.607 14.781 47.613 6.419 12.446
Intersection FMEOsubscript𝐹𝑀𝐸𝑂F_{MEO}italic_F start_POSTSUBSCRIPT italic_M italic_E italic_O end_POSTSUBSCRIPT 28.949 16.662 11.994 18.672 8.505 30.828 19.132 54.201 8.784 39.858 5.130 16.994
FDPsubscript𝐹𝐷𝑃F_{DP}italic_F start_POSTSUBSCRIPT italic_D italic_P end_POSTSUBSCRIPT 11.648 12.215 5.127 6.721 10.157 4.449 11.268 32.584 14.697 20.864 10.087 4.831
FOAEsubscript𝐹𝑂𝐴𝐸F_{OAE}italic_F start_POSTSUBSCRIPT italic_O italic_A italic_E end_POSTSUBSCRIPT 8.442 10.876 10.295 7.210 4.868 8.742 5.638 15.415 8.209 10.843 3.322 4.491
FEOsubscript𝐹𝐸𝑂F_{EO}italic_F start_POSTSUBSCRIPT italic_E italic_O end_POSTSUBSCRIPT 70.162 68.005 32.625 48.53 25.922 78.296 40.971 169.535 33.428 131.755 19.399 38.887
- ACC 92.280 89.282 86.051 94.832 93.676 92.587 94.982 83.652 95.183 91.420 93.254 96.237
AUC 95.605 91.281 83.542 97.878 97.820 96.164 98.115 76.839 98.147 94.618 97.996 98.245
AP 99.207 98.381 96.712 99.631 99.619 99.29 99.668 95.321 99.684 99.011 99.658 99.681
EER 10.951 16.807 24.299 6.756 6.565 9.888 6.02 30.755 5.993 12.449 6.429 7.273
Table 16: Detailed fairness and utility evaluation results on FF++.
Model Type
Naive Frequency Spatial Fairness-enhanced
Dataset Attribute Metric
Xception
 [61]
EfficientB4
 [62]
ViT-B/16
 [63]
F3Net
 [64]
SPSL
 [65]
SRM
 [66]
UCF
 [16]
UnivFD
 [67]
CORE
 [68]
DAW-FDD
 [20]
DAG-FDD
 [20]
PG-FDD
 [21]
DFDC Gender FMEOsubscript𝐹𝑀𝐸𝑂F_{MEO}italic_F start_POSTSUBSCRIPT italic_M italic_E italic_O end_POSTSUBSCRIPT 6.011 3.708 5.567 4.415 2.357 8.711 1.492 7.444 2.062 4.87 2.944 3.687
FDPsubscript𝐹𝐷𝑃F_{DP}italic_F start_POSTSUBSCRIPT italic_D italic_P end_POSTSUBSCRIPT 5.878 0.998 3.43 4.959 3.829 8.35 3.776 5.039 3.662 5.271 4.468 6.348
FOAEsubscript𝐹𝑂𝐴𝐸F_{OAE}italic_F start_POSTSUBSCRIPT italic_O italic_A italic_E end_POSTSUBSCRIPT 2.97 2.744 1.88 2.645 2.537 2.438 1.427 0.075 2.178 3.012 2.233 0.869
FEOsubscript𝐹𝐸𝑂F_{EO}italic_F start_POSTSUBSCRIPT italic_E italic_O end_POSTSUBSCRIPT 6.222 5.564 7.742 5.609 3.833 10.841 2.517 12.51 3.784 4.898 3.95 4.968
Race FMEOsubscript𝐹𝑀𝐸𝑂F_{MEO}italic_F start_POSTSUBSCRIPT italic_M italic_E italic_O end_POSTSUBSCRIPT 8.525 6.846 24.319 7.667 9.726 9.139 11.603 22.342 10.74 15.529 11.403 5.992
FDPsubscript𝐹𝐷𝑃F_{DP}italic_F start_POSTSUBSCRIPT italic_D italic_P end_POSTSUBSCRIPT 21.619 11.534 20.596 25.03 24.594 21.463 25.9 24.317 26.634 23.997 25.534 24.613
FOAEsubscript𝐹𝑂𝐴𝐸F_{OAE}italic_F start_POSTSUBSCRIPT italic_O italic_A italic_E end_POSTSUBSCRIPT 1.622 3.701 12.756 3.048 2.816 5.793 4.722 11.051 2.699 14.659 3.46 2.523
FEOsubscript𝐹𝐸𝑂F_{EO}italic_F start_POSTSUBSCRIPT italic_E italic_O end_POSTSUBSCRIPT 26.728 15.611 47.784 25.2 25.744 24.09 22.014 65.788 23.784 47.679 26.64 12.268
Age FMEOsubscript𝐹𝑀𝐸𝑂F_{MEO}italic_F start_POSTSUBSCRIPT italic_M italic_E italic_O end_POSTSUBSCRIPT 6.193 7.721 17.868 5.022 7.375 10.382 4.608 13.078 5.683 20.119 5.578 3.96
FDPsubscript𝐹𝐷𝑃F_{DP}italic_F start_POSTSUBSCRIPT italic_D italic_P end_POSTSUBSCRIPT 11.068 4.752 14.277 9.967 12.117 8.172 11.987 9.764 9.112 13.229 10.702 11.48
FOAEsubscript𝐹𝑂𝐴𝐸F_{OAE}italic_F start_POSTSUBSCRIPT italic_O italic_A italic_E end_POSTSUBSCRIPT 2.817 5.951 4.984 3.918 2.585 6.092 2.513 7.523 3.869 12.581 2.498 1.653
FEOsubscript𝐹𝐸𝑂F_{EO}italic_F start_POSTSUBSCRIPT italic_E italic_O end_POSTSUBSCRIPT 14.397 16.233 26.396 14.327 13.88 16.625 12.03 22.816 11.274 31.018 8.736 6.954
Intersection FMEOsubscript𝐹𝑀𝐸𝑂F_{MEO}italic_F start_POSTSUBSCRIPT italic_M italic_E italic_O end_POSTSUBSCRIPT 14.479 15.029 33.979 14.067 24.924 14.117 16.119 38.533 17.421 20.447 18.268 10.973
FDPsubscript𝐹𝐷𝑃F_{DP}italic_F start_POSTSUBSCRIPT italic_D italic_P end_POSTSUBSCRIPT 28.877 17.816 30.153 32.117 31.493 27.666 31.604 28.815 33.791 27.812 30.224 31.389
FOAEsubscript𝐹𝑂𝐴𝐸F_{OAE}italic_F start_POSTSUBSCRIPT italic_O italic_A italic_E end_POSTSUBSCRIPT 5.619 8.088 20.771 4.456 7.453 9.306 5.922 14.994 4.423 18.877 5.642 3.832
FEOsubscript𝐹𝐸𝑂F_{EO}italic_F start_POSTSUBSCRIPT italic_E italic_O end_POSTSUBSCRIPT 72.695 60.893 111.03 59.238 67.19 60.749 63.262 133.283 58.174 90.761 64.155 33.495
- ACC 81.223 71.939 71.044 87.658 87.482 83.536 89.155 64.164 88.75 81.452 88.867 92.905
AUC 90.395 80.17 81.942 95.158 95.789 91.837 96.025 72.228 95.65 91.695 95.916 97.014
AP 91.284 81.442 82.547 95.764 96.313 92.435 96.567 75.304 96.219 92.37 96.447 97.081
EER 18.443 28.133 26.271 12.367 10.805 15.588 10.818 33.542 10.927 17.043 10.709 8.317
Table 17: Detailed fairness and utility evaluation results on DFDC.
Model Type
Naive Frequency Spatial Fairness-enhanced
Dataset Attribute Metric
Xception
 [61]
EfficientB4
 [62]
ViT-B/16
 [63]
F3Net
 [64]
SPSL
 [65]
SRM
 [66]
UCF
 [16]
UnivFD
 [67]
CORE
 [68]
DAW-FDD
 [20]
DAG-FDD
 [20]
PG-FDD
 [21]
DFD Gender FMEOsubscript𝐹𝑀𝐸𝑂F_{MEO}italic_F start_POSTSUBSCRIPT italic_M italic_E italic_O end_POSTSUBSCRIPT 7.052 2.139 3.857 6.07 0.269 10.037 6.095 1.039 2.257 3.605 3.059 0.717
FDPsubscript𝐹𝐷𝑃F_{DP}italic_F start_POSTSUBSCRIPT italic_D italic_P end_POSTSUBSCRIPT 5.543 5.864 1.231 4.871 7.154 2.188 5.342 2.327 7.261 6.893 8.593 6.827
FOAEsubscript𝐹𝑂𝐴𝐸F_{OAE}italic_F start_POSTSUBSCRIPT italic_O italic_A italic_E end_POSTSUBSCRIPT 3.445 3.199 5.624 2.212 0.232 4.177 2.313 5.785 2.198 2.996 2.609 1.241
FEOsubscript𝐹𝐸𝑂F_{EO}italic_F start_POSTSUBSCRIPT italic_E italic_O end_POSTSUBSCRIPT 8.657 3.812 3.868 6.325 0.467 10.381 6.731 1.300 4.139 5.855 3.900 1.326
Race FMEOsubscript𝐹𝑀𝐸𝑂F_{MEO}italic_F start_POSTSUBSCRIPT italic_M italic_E italic_O end_POSTSUBSCRIPT 5.975 6.844 12.306 5.574 0.319 18.91 6.141 20.641 6.292 10.597 6.021 11.641
FDPsubscript𝐹𝐷𝑃F_{DP}italic_F start_POSTSUBSCRIPT italic_D italic_P end_POSTSUBSCRIPT 25.863 19.116 11.976 25.678 40.64 21.081 28.174 14.104 26.949 28.784 28.439 29.743
FOAEsubscript𝐹𝑂𝐴𝐸F_{OAE}italic_F start_POSTSUBSCRIPT italic_O italic_A italic_E end_POSTSUBSCRIPT 6.002 10.714 16.754 4.602 0.206 8.565 3.89 15.842 4.917 4.594 4.125 3.797
FEOsubscript𝐹𝐸𝑂F_{EO}italic_F start_POSTSUBSCRIPT italic_E italic_O end_POSTSUBSCRIPT 16.002 17.914 24.819 14.628 0.884 32.788 15.477 51.098 17.872 19.959 13.855 17.467
Age FMEOsubscript𝐹𝑀𝐸𝑂F_{MEO}italic_F start_POSTSUBSCRIPT italic_M italic_E italic_O end_POSTSUBSCRIPT 14.485 13.629 2.744 9.24 0.9 9.38 10.383 6.69 10.768 10.107 10.942 5.6
FDPsubscript𝐹𝐷𝑃F_{DP}italic_F start_POSTSUBSCRIPT italic_D italic_P end_POSTSUBSCRIPT 34.386 18.578 10.063 32.355 20.119 18.826 33.553 4.892 34.865 32.253 34.503 27.165
FOAEsubscript𝐹𝑂𝐴𝐸F_{OAE}italic_F start_POSTSUBSCRIPT italic_O italic_A italic_E end_POSTSUBSCRIPT 11.001 18.255 22.847 6.943 0.434 13.797 7.41 23.7 5.315 7.635 6.256 6.095
FEOsubscript𝐹𝐸𝑂F_{EO}italic_F start_POSTSUBSCRIPT italic_E italic_O end_POSTSUBSCRIPT 22.487 33.616 6.473 15.786 1.97 13.896 18.326 12.859 16.44 14.035 13.272 8.349
Intersection FMEOsubscript𝐹𝑀𝐸𝑂F_{MEO}italic_F start_POSTSUBSCRIPT italic_M italic_E italic_O end_POSTSUBSCRIPT 15.691 37.9 20.833 13.62 1.786 27.246 11.053 35.828 18.056 20.833 9.157 12.903
FDPsubscript𝐹𝐷𝑃F_{DP}italic_F start_POSTSUBSCRIPT italic_D italic_P end_POSTSUBSCRIPT 35.824 31.581 18.295 36.56 53.771 29.097 38.828 28.054 38.536 41.172 39.027 41.388
FOAEsubscript𝐹𝑂𝐴𝐸F_{OAE}italic_F start_POSTSUBSCRIPT italic_O italic_A italic_E end_POSTSUBSCRIPT 9.913 15.939 21.972 6.863 1.322 11.216 6.327 22.706 7.158 6.101 6.097 5.31
FEOsubscript𝐹𝐸𝑂F_{EO}italic_F start_POSTSUBSCRIPT italic_E italic_O end_POSTSUBSCRIPT 46.408 79.825 68.93 42.743 7.073 91.155 41.325 111.273 53.592 49.779 40.678 41.822
- ACC 93.039 88.321 83.862 94.6 99.505 91.405 94.984 80.753 94.761 92.99 94.6 97.102
AUC 97.507 93.914 89.886 98.478 99.942 96.347 98.592 82.817 98.651 97.659 98.813 99.082
AP 99.349 98.366 97.059 99.596 99.965 98.929 99.614 95.008 99.62 99.375 99.687 99.75
EER 8.086 13.377 18.014 6.183 0.500 10.048 6.124 24.911 5.945 7.788 5.529 5.470
Table 18: Detailed fairness and utility evaluation results on DFD.
Model Type
Naive Frequency Spatial Fairness-enhanced
Dataset Attribute Metric
Xception
 [61]
EfficientB4
 [62]
ViT-B/16
 [63]
F3Net
 [64]
SPSL
 [65]
SRM
 [66]
UCF
 [16]
UnivFD
 [67]
CORE
 [68]
DAW-FDD
 [20]
DAG-FDD
 [20]
PG-FDD
 [21]
Celeb- DF-v2 Gender FMEOsubscript𝐹𝑀𝐸𝑂F_{MEO}italic_F start_POSTSUBSCRIPT italic_M italic_E italic_O end_POSTSUBSCRIPT 1.764 8.434 10.889 0.584 2.701 2.377 2.645 13.511 2.78 1.312 0.997 1.584
FDPsubscript𝐹𝐷𝑃F_{DP}italic_F start_POSTSUBSCRIPT italic_D italic_P end_POSTSUBSCRIPT 6.072 7.227 0.405 6.219 8.541 7.023 7 9.706 6.693 6.071 7.104 6.023
FOAEsubscript𝐹𝑂𝐴𝐸F_{OAE}italic_F start_POSTSUBSCRIPT italic_O italic_A italic_E end_POSTSUBSCRIPT 1.578 2.063 5.663 1.092 2.149 0.976 0.884 8.053 0.636 2.411 0.831 0.429
FEOsubscript𝐹𝐸𝑂F_{EO}italic_F start_POSTSUBSCRIPT italic_E italic_O end_POSTSUBSCRIPT 2.585 10.238 11.073 1.108 3.236 3.379 3.564 20.484 3.369 1.599 1.519 1.601
Race FMEOsubscript𝐹𝑀𝐸𝑂F_{MEO}italic_F start_POSTSUBSCRIPT italic_M italic_E italic_O end_POSTSUBSCRIPT 5.583 9.879 14.539 7.288 8.943 9.753 4.16 32.306 4.502 21.999 8.275 7.45
FDPsubscript𝐹𝐷𝑃F_{DP}italic_F start_POSTSUBSCRIPT italic_D italic_P end_POSTSUBSCRIPT 19.474 19.627 14.411 21.812 24.643 16.882 20.953 12.222 21.337 22.694 24.787 16.744
FOAEsubscript𝐹𝑂𝐴𝐸F_{OAE}italic_F start_POSTSUBSCRIPT italic_O italic_A italic_E end_POSTSUBSCRIPT 6.569 9.664 12.618 4.032 6.035 5.493 3.524 10.82 3.813 3.092 5.392 2.815
FEOsubscript𝐹𝐸𝑂F_{EO}italic_F start_POSTSUBSCRIPT italic_E italic_O end_POSTSUBSCRIPT 10.691 25.652 28.759 13.013 11.726 15.42 14.684 63.524 9.671 58.225 14.714 12.384
Age FMEOsubscript𝐹𝑀𝐸𝑂F_{MEO}italic_F start_POSTSUBSCRIPT italic_M italic_E italic_O end_POSTSUBSCRIPT 7.172 7.331 15.248 6.974 1.948 8.784 3.873 29.904 3.539 5.903 2.508 5.968
FDPsubscript𝐹𝐷𝑃F_{DP}italic_F start_POSTSUBSCRIPT italic_D italic_P end_POSTSUBSCRIPT 33.004 25.16 6.737 33.891 33.072 33.648 32.236 18.794 32.986 24.932 34.577 32.264
FOAEsubscript𝐹𝑂𝐴𝐸F_{OAE}italic_F start_POSTSUBSCRIPT italic_O italic_A italic_E end_POSTSUBSCRIPT 1.925 8.576 26.359 1.628 1.532 1.149 2.502 12.526 3.482 10.577 0.845 1.183
FEOsubscript𝐹𝐸𝑂F_{EO}italic_F start_POSTSUBSCRIPT italic_E italic_O end_POSTSUBSCRIPT 11.497 14.073 19.966 11.657 5.013 11.178 7.669 53.72 9.685 10.027 5.404 7.037
Intersection FMEOsubscript𝐹𝑀𝐸𝑂F_{MEO}italic_F start_POSTSUBSCRIPT italic_M italic_E italic_O end_POSTSUBSCRIPT 20 32.79 57.779 14.286 28.571 16.19 14.286 58.368 16.774 25.477 14.286 12.381
FDPsubscript𝐹𝐷𝑃F_{DP}italic_F start_POSTSUBSCRIPT italic_D italic_P end_POSTSUBSCRIPT 76.368 78.595 67.795 76.672 77.839 77.371 76.863 67.188 76.881 77.935 75.761 77.349
FOAEsubscript𝐹𝑂𝐴𝐸F_{OAE}italic_F start_POSTSUBSCRIPT italic_O italic_A italic_E end_POSTSUBSCRIPT 19.231 16.228 49.562 11.538 8.463 7.334 7.692 29.689 11.538 5.769 7.692 5.769
FEOsubscript𝐹𝐸𝑂F_{EO}italic_F start_POSTSUBSCRIPT italic_E italic_O end_POSTSUBSCRIPT 71.129 114.538 103.126 53.655 59.887 59.694 60.765 182.729 61.653 141.381 48.621 33.495
- ACC 97.43 95.129 91.548 98.145 97.511 98.073 98.263 88.191 98.221 96.073 98.405 98.754
AUC 99.345 97.548 96.504 99.652 99.579 99.448 99.684 83.086 99.685 98.377 99.702 99.815
AP 99.908 99.641 99.492 99.953 99.943 99.923 99.957 97.068 99.957 99.763 99.96 99.974
EER 3.733 8.041 9.747 2.189 2.074 2.857 2.051 25.184 2.143 6.382 1.636 2.281
Table 19: Detailed fairness and utility evaluation results on Celeb-DF-v2.
Model Type
Naive Frequency Spatial Fairness-enhanced
Dataset Attribute Metric
Xception
 [61]
EfficientB4
 [62]
ViT-B/16
 [63]
F3Net
 [64]
SPSL
 [65]
SRM
 [66]
UCF
 [16]
UnivFD
 [67]
CORE
 [68]
DAW-FDD
 [20]
DAG-FDD
 [20]
PG-FDD
 [21]
AttGAN Gender FMEOsubscript𝐹𝑀𝐸𝑂F_{MEO}italic_F start_POSTSUBSCRIPT italic_M italic_E italic_O end_POSTSUBSCRIPT 0.56 1.669 0.472 0.946 0.459 1.554 0.422 4.79 3.544 1.941 0.153 1.47
FDPsubscript𝐹𝐷𝑃F_{DP}italic_F start_POSTSUBSCRIPT italic_D italic_P end_POSTSUBSCRIPT 10.923 12.249 10.678 11.069 11.068 10.489 11.295 9.998 12.045 11.936 11.171 12.205
FOAEsubscript𝐹𝑂𝐴𝐸F_{OAE}italic_F start_POSTSUBSCRIPT italic_O italic_A italic_E end_POSTSUBSCRIPT 0.619 0.287 0.739 1.053 0.432 1.136 0.085 3.38 0.75 0.721 0.165 0.288
FEOsubscript𝐹𝐸𝑂F_{EO}italic_F start_POSTSUBSCRIPT italic_E italic_O end_POSTSUBSCRIPT 1.096 3.05 0.816 1.695 0.617 2.078 0.676 6.049 3.62 2.379 0.177 2.275
Race FMEOsubscript𝐹𝑀𝐸𝑂F_{MEO}italic_F start_POSTSUBSCRIPT italic_M italic_E italic_O end_POSTSUBSCRIPT 3.39 3.613 3.228 3.198 3.918 4.013 3.03 18.643 17.655 7.576 1.887 1.695
FDPsubscript𝐹𝐷𝑃F_{DP}italic_F start_POSTSUBSCRIPT italic_D italic_P end_POSTSUBSCRIPT 11.859 13.88 11.05 12.876 10.628 11.582 13.059 22.615 16.834 13.753 13.054 13.502
FOAEsubscript𝐹𝑂𝐴𝐸F_{OAE}italic_F start_POSTSUBSCRIPT italic_O italic_A italic_E end_POSTSUBSCRIPT 1.587 2.174 2.526 2.387 2.033 2.31 2.521 4.713 5.636 5.042 1.6 1.6
FEOsubscript𝐹𝐸𝑂F_{EO}italic_F start_POSTSUBSCRIPT italic_E italic_O end_POSTSUBSCRIPT 5.016 10.539 9.994 5.975 9.472 9.239 5.592 37.89 19.269 12.917 4.291 5.117
Age FMEOsubscript𝐹𝑀𝐸𝑂F_{MEO}italic_F start_POSTSUBSCRIPT italic_M italic_E italic_O end_POSTSUBSCRIPT 3.086 4.899 7.144 1.194 3.704 3.096 2.469 15.211 5.996 2.855 2.206 5.493
FDPsubscript𝐹𝐷𝑃F_{DP}italic_F start_POSTSUBSCRIPT italic_D italic_P end_POSTSUBSCRIPT 22.439 21.386 18.175 23.439 22.14 22.789 24.491 20.473 21.175 22.14 24.789 21.193
FOAEsubscript𝐹𝑂𝐴𝐸F_{OAE}italic_F start_POSTSUBSCRIPT italic_O italic_A italic_E end_POSTSUBSCRIPT 1.105 3.595 4.689 0.942 2.563 3.132 2.456 6.436 4.493 2.309 0.398 3.758
FEOsubscript𝐹𝐸𝑂F_{EO}italic_F start_POSTSUBSCRIPT italic_E italic_O end_POSTSUBSCRIPT 5.209 10.255 13.136 4.103 10.371 10.312 5.807 36.092 8.639 6.932 3.746 7.407
Intersection FMEOsubscript𝐹𝑀𝐸𝑂F_{MEO}italic_F start_POSTSUBSCRIPT italic_M italic_E italic_O end_POSTSUBSCRIPT 5.128 11.111 11.111 7.692 7.407 6.667 7.407 31.774 33.333 7.692 3.125 5
FDPsubscript𝐹𝐷𝑃F_{DP}italic_F start_POSTSUBSCRIPT italic_D italic_P end_POSTSUBSCRIPT 20.594 24.253 20.152 21.106 20.783 19.375 22.003 28.514 28.753 21.677 22.003 23.411
FOAEsubscript𝐹𝑂𝐴𝐸F_{OAE}italic_F start_POSTSUBSCRIPT italic_O italic_A italic_E end_POSTSUBSCRIPT 4.225 7.042 4.968 5.634 3.177 4.878 4.348 16.17 10.976 5.479 2.817 1.852
FEOsubscript𝐹𝐸𝑂F_{EO}italic_F start_POSTSUBSCRIPT italic_E italic_O end_POSTSUBSCRIPT 21.471 42.762 35.107 23.6 25.943 31.389 22.546 92.264 46.215 33.053 12.897 17.18
- ACC 98.482 97.884 95.86 98.62 98.482 98.666 98.712 80.957 96.274 97.608 99.264 99.126
AUC 99.798 99.526 99.259 99.776 99.702 99.642 99.875 89.719 98.721 99.722 99.781 99.953
AP 99.795 99.492 99.282 99.797 99.612 99.587 99.888 91.76 98.646 99.732 99.827 99.958
EER 1.594 2.092 4.084 1.494 1.494 1.394 1.195 18.426 4.98 2.39 0.996 1.494
Table 20: Detailed fairness and utility evaluation results on AttGAN.
Model Type
Naive Frequency Spatial Fairness-enhanced
Dataset Attribute Metric
Xception
 [61]
EfficientB4
 [62]
ViT-B/16
 [63]
F3Net
 [64]
SPSL
 [65]
SRM
 [66]
UCF
 [16]
UnivFD
 [67]
CORE
 [68]
DAW-FDD
 [20]
DAG-FDD
 [20]
PG-FDD
 [21]
MMDGAN Gender FMEOsubscript𝐹𝑀𝐸𝑂F_{MEO}italic_F start_POSTSUBSCRIPT italic_M italic_E italic_O end_POSTSUBSCRIPT 2.722 17.144 0.773 1.087 3.809 3.261 4.622 1.626 3.394 8.439 11.417 2.448
FDPsubscript𝐹𝐷𝑃F_{DP}italic_F start_POSTSUBSCRIPT italic_D italic_P end_POSTSUBSCRIPT 8.077 16.007 6.925 7.801 8.077 7.939 9.604 6.787 9.584 11.082 13.928 9.141
FOAEsubscript𝐹𝑂𝐴𝐸F_{OAE}italic_F start_POSTSUBSCRIPT italic_O italic_A italic_E end_POSTSUBSCRIPT 2.335 7.348 1.084 1.271 3.398 3.26 2.797 0.63 0.512 4.275 4.994 1.133
FEOsubscript𝐹𝐸𝑂F_{EO}italic_F start_POSTSUBSCRIPT italic_E italic_O end_POSTSUBSCRIPT 4.772 18.286 0.773 2.028 6.9 6.352 5.63 3.071 4.481 9.447 12.493 2.448
Race FMEOsubscript𝐹𝑀𝐸𝑂F_{MEO}italic_F start_POSTSUBSCRIPT italic_M italic_E italic_O end_POSTSUBSCRIPT 10 73.95 8.974 16.667 14.286 8.333 8.333 33.333 16.667 28.571 28.571 2.21
FDPsubscript𝐹𝐷𝑃F_{DP}italic_F start_POSTSUBSCRIPT italic_D italic_P end_POSTSUBSCRIPT 33 55 32 33 38 33 33 33 33 43 43 33
FOAEsubscript𝐹𝑂𝐴𝐸F_{OAE}italic_F start_POSTSUBSCRIPT italic_O italic_A italic_E end_POSTSUBSCRIPT 9.091 22 5.233 9.091 5 4.545 4.545 18.182 9.091 10 10 1.187
FEOsubscript𝐹𝐸𝑂F_{EO}italic_F start_POSTSUBSCRIPT italic_E italic_O end_POSTSUBSCRIPT 23.462 91.198 13.141 22.352 28.661 15.508 19.122 48.478 25.201 39.585 44.593 5.931
Age FMEOsubscript𝐹𝑀𝐸𝑂F_{MEO}italic_F start_POSTSUBSCRIPT italic_M italic_E italic_O end_POSTSUBSCRIPT 11.706 22.297 4.808 10.345 10.345 10.345 7.642 6.924 9.091 14.336 14.559 9.091
FDPsubscript𝐹𝐷𝑃F_{DP}italic_F start_POSTSUBSCRIPT italic_D italic_P end_POSTSUBSCRIPT 17.703 20.303 10.909 9.394 11.515 11.212 11.818 10.303 12.727 8.788 12.727 11.818
FOAEsubscript𝐹𝑂𝐴𝐸F_{OAE}italic_F start_POSTSUBSCRIPT italic_O italic_A italic_E end_POSTSUBSCRIPT 3.939 10.606 2.424 5.455 9.091 6.515 4.127 3.828 2.233 6.89 8.254 5.263
FEOsubscript𝐹𝐸𝑂F_{EO}italic_F start_POSTSUBSCRIPT italic_E italic_O end_POSTSUBSCRIPT 23.368 39.483 6.422 13.465 22.203 25.61 17.407 13.78 17.996 21.652 24.926 10.142
Intersection FMEOsubscript𝐹𝑀𝐸𝑂F_{MEO}italic_F start_POSTSUBSCRIPT italic_M italic_E italic_O end_POSTSUBSCRIPT 12.5 100 22.222 22.222 16.667 11.111 11.111 44.444 22.222 100 33.333 4.167
FDPsubscript𝐹𝐷𝑃F_{DP}italic_F start_POSTSUBSCRIPT italic_D italic_P end_POSTSUBSCRIPT 58.088 62.5 51.471 58.088 58.088 58.088 58.088 58.088 58.088 70.588 58.088 58.088
FOAEsubscript𝐹𝑂𝐴𝐸F_{OAE}italic_F start_POSTSUBSCRIPT italic_O italic_A italic_E end_POSTSUBSCRIPT 11.765 41.667 12.5 11.765 8.333 5.882 5.882 23.529 11.765 12.5 16.667 1.948
FEOsubscript𝐹𝐸𝑂F_{EO}italic_F start_POSTSUBSCRIPT italic_E italic_O end_POSTSUBSCRIPT 51.536 230.67 59.507 44.743 59.982 39.11 42.671 103.142 55.926 161.146 85.72 14.412
- ACC 97.525 90.099 95.792 98.02 97.03 98.02 97.772 93.812 96.535 94.307 96.287 99.01
AUC 99.395 97.839 99.299 99.687 99.508 99.392 99.781 97.987 98.521 99.515 99.808 99.98
AP 98.918 97.589 99.226 99.691 99.461 99.215 99.792 97.182 98.25 99.525 99.812 99.983
EER 2.646 7.407 3.704 1.587 2.646 2.646 2.116 6.349 4.233 1.587 1.587 0.529
Table 21: Detailed fairness and utility evaluation results on MMDGAN.
Model Type
Naive Frequency Spatial Fairness-enhanced
Dataset Attribute Metric
Xception
 [61]
EfficientB4
 [62]
ViT-B/16
 [63]
F3Net
 [64]
SPSL
 [65]
SRM
 [66]
UCF
 [16]
UnivFD
 [67]
CORE
 [68]
DAW-FDD
 [20]
DAG-FDD
 [20]
PG-FDD
 [21]
StarGAN Gender FMEOsubscript𝐹𝑀𝐸𝑂F_{MEO}italic_F start_POSTSUBSCRIPT italic_M italic_E italic_O end_POSTSUBSCRIPT 0.281 2.181 0.472 0.33 0.388 0.552 0.044 2.075 1.737 0.344 0.627 0.33
FDPsubscript𝐹𝐷𝑃F_{DP}italic_F start_POSTSUBSCRIPT italic_D italic_P end_POSTSUBSCRIPT 4.379 5.639 3.887 4.547 4.555 4.744 4.461 5.32 5.209 4.678 4.954 4.65
FOAEsubscript𝐹𝑂𝐴𝐸F_{OAE}italic_F start_POSTSUBSCRIPT italic_O italic_A italic_E end_POSTSUBSCRIPT 0.181 0.604 0.435 0.3 0.275 0.332 0.008 0.095 0.214 0.07 0.107 0.197
FEOsubscript𝐹𝐸𝑂F_{EO}italic_F start_POSTSUBSCRIPT italic_E italic_O end_POSTSUBSCRIPT 0.408 2.679 0.59 0.486 0.561 0.619 0.049 2.74 2.18 0.606 0.957 0.375
Race FMEOsubscript𝐹𝑀𝐸𝑂F_{MEO}italic_F start_POSTSUBSCRIPT italic_M italic_E italic_O end_POSTSUBSCRIPT 4 4.577 11.031 4 8 4 8 22.727 11.197 6.113 4 2.062
FDPsubscript𝐹𝐷𝑃F_{DP}italic_F start_POSTSUBSCRIPT italic_D italic_P end_POSTSUBSCRIPT 27.875 27.493 29.459 28.39 29.39 26.875 29.086 35.768 30.974 27.403 27.875 26.056
FOAEsubscript𝐹𝑂𝐴𝐸F_{OAE}italic_F start_POSTSUBSCRIPT italic_O italic_A italic_E end_POSTSUBSCRIPT 1.515 3.036 4.571 1.515 3.03 3.03 3.03 6.682 2.931 1.395 1.515 1.325
FEOsubscript𝐹𝐸𝑂F_{EO}italic_F start_POSTSUBSCRIPT italic_E italic_O end_POSTSUBSCRIPT 7.666 11.827 17.243 8.507 10.863 8.762 10.006 29.26 17.857 12.09 6.291 4.109
Age FMEOsubscript𝐹𝑀𝐸𝑂F_{MEO}italic_F start_POSTSUBSCRIPT italic_M italic_E italic_O end_POSTSUBSCRIPT 2.479 3.577 5.091 3.167 1.667 2.5 1.379 1.167 4.562 2.033 1.667 0.943
FDPsubscript𝐹𝐷𝑃F_{DP}italic_F start_POSTSUBSCRIPT italic_D italic_P end_POSTSUBSCRIPT 19.078 17.476 16.434 19.802 19.399 19.078 19.319 17.244 18.927 17.39 19.078 19.158
FOAEsubscript𝐹𝑂𝐴𝐸F_{OAE}italic_F start_POSTSUBSCRIPT italic_O italic_A italic_E end_POSTSUBSCRIPT 1.132 2.264 2.119 0.323 1.201 2.264 1.132 2.075 0.843 0.908 1.509 0.601
FEOsubscript𝐹𝐸𝑂F_{EO}italic_F start_POSTSUBSCRIPT italic_E italic_O end_POSTSUBSCRIPT 4.659 6.124 9.058 5.728 4.038 5.801 2.946 2.421 6.082 4.539 3.86 2.745
Intersection FMEOsubscript𝐹𝑀𝐸𝑂F_{MEO}italic_F start_POSTSUBSCRIPT italic_M italic_E italic_O end_POSTSUBSCRIPT 14.286 12.5 21.774 6.25 11.111 6.25 14.286 25 18.75 11.111 5.556 2.381
FDPsubscript𝐹𝐷𝑃F_{DP}italic_F start_POSTSUBSCRIPT italic_D italic_P end_POSTSUBSCRIPT 30.971 32.154 36.599 31.973 33.612 29.932 31.571 38.639 36.417 31.791 31.571 28.326
FOAEsubscript𝐹𝑂𝐴𝐸F_{OAE}italic_F start_POSTSUBSCRIPT italic_O italic_A italic_E end_POSTSUBSCRIPT 5.882 5.882 7.418 2.222 4.082 4.082 5.882 8.889 5.462 4.082 2.041 1.471
FEOsubscript𝐹𝐸𝑂F_{EO}italic_F start_POSTSUBSCRIPT italic_E italic_O end_POSTSUBSCRIPT 22.432 38.089 41.688 20.567 19.704 19.605 25.484 59.844 36.081 26.652 11.756 8.426
- ACC 99.326 98.289 96.216 99.015 99.274 99.378 99.43 94.66 96.319 98.237 99.482 99.533
AUC 99.874 99.773 99.556 99.909 99.86 99.869 99.964 99.626 99.076 99.796 99.909 99.983
AP 99.899 99.797 99.56 99.933 99.826 99.832 99.97 99.724 99.079 99.809 99.929 99.986
EER 0.795 1.135 2.611 0.454 0.795 0.681 0.568 1.93 3.973 1.589 0.568 0.454
Table 22: Detailed fairness and utility evaluation results on StarGAN.
Model Type
Naive Frequency Spatial Fairness-enhanced
Dataset Attribute Metric
Xception
 [61]
EfficientB4
 [62]
ViT-B/16
 [63]
F3Net
 [64]
SPSL
 [65]
SRM
 [66]
UCF
 [16]
UnivFD
 [67]
CORE
 [68]
DAW-FDD
 [20]
DAG-FDD
 [20]
PG-FDD
 [21]
StyleGAN Gender FMEOsubscript𝐹𝑀𝐸𝑂F_{MEO}italic_F start_POSTSUBSCRIPT italic_M italic_E italic_O end_POSTSUBSCRIPT 0.436 2.543 1.046 1.136 1.558 0.561 0.447 3.248 3.136 0.44 0.789 0.136
FDPsubscript𝐹𝐷𝑃F_{DP}italic_F start_POSTSUBSCRIPT italic_D italic_P end_POSTSUBSCRIPT 20.675 21.927 20.009 20.923 21.864 20.013 20.847 20.617 21.168 20.521 21.311 21.15
FOAEsubscript𝐹𝑂𝐴𝐸F_{OAE}italic_F start_POSTSUBSCRIPT italic_O italic_A italic_E end_POSTSUBSCRIPT 0.17 0.208 0.869 0.551 0.344 0.496 0.205 1.003 0.187 0.029 0.404 0.027
FEOsubscript𝐹𝐸𝑂F_{EO}italic_F start_POSTSUBSCRIPT italic_E italic_O end_POSTSUBSCRIPT 0.533 3.444 1.191 1.686 2.035 0.926 0.454 3.747 3.316 0.73 1.009 0.205
Race FMEOsubscript𝐹𝑀𝐸𝑂F_{MEO}italic_F start_POSTSUBSCRIPT italic_M italic_E italic_O end_POSTSUBSCRIPT 4.078 13.916 11.459 4.498 3.659 11.815 2.439 20.18 18.22 2.941 1.22 2.105
FDPsubscript𝐹𝐷𝑃F_{DP}italic_F start_POSTSUBSCRIPT italic_D italic_P end_POSTSUBSCRIPT 25.95 25.845 24.097 24.671 24.566 27.593 24.207 27.273 25.481 23.849 24.053 24.257
FOAEsubscript𝐹𝑂𝐴𝐸F_{OAE}italic_F start_POSTSUBSCRIPT italic_O italic_A italic_E end_POSTSUBSCRIPT 1.149 5.309 4.74 1.905 1.693 4.72 0.607 8.377 7.544 0.892 0.635 1.075
FEOsubscript𝐹𝐸𝑂F_{EO}italic_F start_POSTSUBSCRIPT italic_E italic_O end_POSTSUBSCRIPT 7.696 21.343 19.263 8.82 6.977 15.593 4.642 30.007 26.199 5.893 3.345 3.163
Age FMEOsubscript𝐹𝑀𝐸𝑂F_{MEO}italic_F start_POSTSUBSCRIPT italic_M italic_E italic_O end_POSTSUBSCRIPT 9.065 19.373 11.673 18.17 19.059 17.073 1.491 9.843 7.494 9.065 9.53 1.556
FDPsubscript𝐹𝐷𝑃F_{DP}italic_F start_POSTSUBSCRIPT italic_D italic_P end_POSTSUBSCRIPT 49.291 52.488 44.166 48.832 50.785 49.475 47.41 40.723 44.455 49.356 49.553 48.55
FOAEsubscript𝐹𝑂𝐴𝐸F_{OAE}italic_F start_POSTSUBSCRIPT italic_O italic_A italic_E end_POSTSUBSCRIPT 0.783 2.425 14.36 2.943 2.104 1.163 1.068 10.291 8.482 0.908 1.64 1.333
FEOsubscript𝐹𝐸𝑂F_{EO}italic_F start_POSTSUBSCRIPT italic_E italic_O end_POSTSUBSCRIPT 15.085 35.787 23.787 24.836 29.459 28.836 3.87 22.027 11.175 15.268 19.652 2.362
Intersection FMEOsubscript𝐹𝑀𝐸𝑂F_{MEO}italic_F start_POSTSUBSCRIPT italic_M italic_E italic_O end_POSTSUBSCRIPT 7.407 17.306 16.78 7.143 7.143 17.857 3.704 24.774 26.04 4.054 3.571 2.817
FDPsubscript𝐹𝐷𝑃F_{DP}italic_F start_POSTSUBSCRIPT italic_D italic_P end_POSTSUBSCRIPT 47.301 51.383 47.73 47.13 50.191 48.999 47.301 50.62 50.534 47.215 48.236 48.322
FOAEsubscript𝐹𝑂𝐴𝐸F_{OAE}italic_F start_POSTSUBSCRIPT italic_O italic_A italic_E end_POSTSUBSCRIPT 4.545 6.281 6.404 4.082 2.273 6.818 2.273 8.961 9.494 3.061 2.041 1.105
FEOsubscript𝐹𝐸𝑂F_{EO}italic_F start_POSTSUBSCRIPT italic_E italic_O end_POSTSUBSCRIPT 20.331 50.793 39.145 22.806 19.225 41.872 11.758 53.806 50.322 16.809 12.616 7.417
- ACC 98.975 97.819 96.347 98.476 99.08 97.976 99.527 94.77 96.399 99.054 99.448 99.685
AUC 99.925 99.794 99.392 99.861 99.964 99.902 99.985 99.703 99.51 99.892 99.986 99.979
AP 99.94 99.854 99.386 99.904 99.97 99.916 99.988 99.756 99.316 99.925 99.989 99.981
EER 0.982 1.443 3.753 1.501 0.693 1.27 0.52 2.887 2.483 0.982 0.52 0.462
Table 23: Detailed fairness and utility evaluation results on StyleGAN.
Model Type
Naive Frequency Spatial Fairness-enhanced
Dataset Attribute Metric
Xception
 [61]
EfficientB4
 [62]
ViT-B/16
 [63]
F3Net
 [64]
SPSL
 [65]
SRM
 [66]
UCF
 [16]
UnivFD
 [67]
CORE
 [68]
DAW-FDD
 [20]
DAG-FDD
 [20]
PG-FDD
 [21]
StyleGAN2 Gender FMEOsubscript𝐹𝑀𝐸𝑂F_{MEO}italic_F start_POSTSUBSCRIPT italic_M italic_E italic_O end_POSTSUBSCRIPT 0.847 0.56 1.556 0.976 0.487 0.666 0.27 0.447 1.073 0.534 0.241 0.045
FDPsubscript𝐹𝐷𝑃F_{DP}italic_F start_POSTSUBSCRIPT italic_D italic_P end_POSTSUBSCRIPT 3.512 2.482 3.231 3.434 2.926 2.668 2.988 2.775 2.129 2.984 2.961 2.871
FOAEsubscript𝐹𝑂𝐴𝐸F_{OAE}italic_F start_POSTSUBSCRIPT italic_O italic_A italic_E end_POSTSUBSCRIPT 0.077 0.341 0.624 0.246 0.342 0.636 0.092 0.117 0.698 0.31 0.092 0.022
FEOsubscript𝐹𝐸𝑂F_{EO}italic_F start_POSTSUBSCRIPT italic_E italic_O end_POSTSUBSCRIPT 1.538 0.594 1.7 1.338 0.686 1.192 0.317 0.482 1.113 0.56 0.263 0.047
Race FMEOsubscript𝐹𝑀𝐸𝑂F_{MEO}italic_F start_POSTSUBSCRIPT italic_M italic_E italic_O end_POSTSUBSCRIPT 1.037 5.147 6.385 1.057 1.401 6.072 1.244 15.519 16.197 2.565 0.926 0.517
FDPsubscript𝐹𝐷𝑃F_{DP}italic_F start_POSTSUBSCRIPT italic_D italic_P end_POSTSUBSCRIPT 33.803 35.296 33.103 34.638 35.076 36.619 35.925 38.228 38.381 34.674 35.711 35.522
FOAEsubscript𝐹𝑂𝐴𝐸F_{OAE}italic_F start_POSTSUBSCRIPT italic_O italic_A italic_E end_POSTSUBSCRIPT 1.451 1.354 1.4 0.471 0.506 1.907 0.229 2.583 2.74 1.331 0.38 0.247
FEOsubscript𝐹𝐸𝑂F_{EO}italic_F start_POSTSUBSCRIPT italic_E italic_O end_POSTSUBSCRIPT 2.251 7.675 13.968 2.428 2.762 7.489 2.377 17.827 19.998 5.173 2.369 1.48
Age FMEOsubscript𝐹𝑀𝐸𝑂F_{MEO}italic_F start_POSTSUBSCRIPT italic_M italic_E italic_O end_POSTSUBSCRIPT 2.766 2.561 8.543 3.408 3.328 2.493 2.486 9.408 9.634 6.514 2.532 0.607
FDPsubscript𝐹𝐷𝑃F_{DP}italic_F start_POSTSUBSCRIPT italic_D italic_P end_POSTSUBSCRIPT 16.251 16.177 16.677 16.418 16.74 15.95 16.669 16.91 18.079 12.621 16.323 15.762
FOAEsubscript𝐹𝑂𝐴𝐸F_{OAE}italic_F start_POSTSUBSCRIPT italic_O italic_A italic_E end_POSTSUBSCRIPT 1.016 0.375 2.35 1.05 1.265 2.008 0.647 1.74 2.042 5.433 0.97 0.528
FEOsubscript𝐹𝐸𝑂F_{EO}italic_F start_POSTSUBSCRIPT italic_E italic_O end_POSTSUBSCRIPT 5.353 2.926 13.051 5.249 4.883 6.547 3.779 12.811 10.052 11.092 3.689 1.966
Intersection FMEOsubscript𝐹𝑀𝐸𝑂F_{MEO}italic_F start_POSTSUBSCRIPT italic_M italic_E italic_O end_POSTSUBSCRIPT 2.436 5.448 9.55 2.127 2.384 7.475 1.468 18.286 20.753 3.132 1.511 0.695
FDPsubscript𝐹𝐷𝑃F_{DP}italic_F start_POSTSUBSCRIPT italic_D italic_P end_POSTSUBSCRIPT 37.77 39.222 35.56 38.81 38.411 39.991 39.446 42.732 42.186 38.965 39.128 38.732
FOAEsubscript𝐹𝑂𝐴𝐸F_{OAE}italic_F start_POSTSUBSCRIPT italic_O italic_A italic_E end_POSTSUBSCRIPT 1.896 2.822 2.726 0.658 1.073 2.369 0.643 3.644 4.862 1.795 0.575 0.488
FEOsubscript𝐹𝐸𝑂F_{EO}italic_F start_POSTSUBSCRIPT italic_E italic_O end_POSTSUBSCRIPT 9.7 18.646 26.948 7.826 7.696 17.235 5.947 35.527 41.168 13.195 5.737 3.074
- ACC 97.46 98.044 95.299 98.472 98.799 98.331 99.311 94.745 96.23 97.207 99.32 99.479
AUC 99.738 99.698 98.85 99.794 99.816 99.741 99.877 99.205 99.209 99.713 99.883 99.968
AP 99.787 99.656 98.871 99.819 99.794 99.715 99.861 99.205 98.979 99.759 99.901 99.97
EER 2.161 1.066 5.234 1.542 1.17 1.309 0.704 3.906 3.019 2.374 0.699 0.535
Table 24: Detailed fairness and utility evaluation results on StyleGAN2.
Model Type
Naive Frequency Spatial Fairness-enhanced
Dataset Attribute Metric
Xception
 [61]
EfficientB4
 [62]
ViT-B/16
 [63]
F3Net
 [64]
SPSL
 [65]
SRM
 [66]
UCF
 [16]
UnivFD
 [67]
CORE
 [68]
DAW-FDD
 [20]
DAG-FDD
 [20]
PG-FDD
 [21]
StyleGAN3 Gender FMEOsubscript𝐹𝑀𝐸𝑂F_{MEO}italic_F start_POSTSUBSCRIPT italic_M italic_E italic_O end_POSTSUBSCRIPT 0.645 1.688 1.345 0.399 0.177 0.132 0.339 1.194 1.626 1.868 0.378 0.23
FDPsubscript𝐹𝐷𝑃F_{DP}italic_F start_POSTSUBSCRIPT italic_D italic_P end_POSTSUBSCRIPT 5.585 6.032 4.595 5.195 5.327 5.274 5.532 5.417 5.837 6.14 5.536 5.482
FOAEsubscript𝐹𝑂𝐴𝐸F_{OAE}italic_F start_POSTSUBSCRIPT italic_O italic_A italic_E end_POSTSUBSCRIPT 0.374 0.558 1.26 0.235 0.113 0.031 0.174 0.062 0.221 0.886 0.173 0.086
FEOsubscript𝐹𝐸𝑂F_{EO}italic_F start_POSTSUBSCRIPT italic_E italic_O end_POSTSUBSCRIPT 0.774 1.792 1.843 0.56 0.254 0.219 0.382 1.305 1.842 1.952 0.421 0.264
Race FMEOsubscript𝐹𝑀𝐸𝑂F_{MEO}italic_F start_POSTSUBSCRIPT italic_M italic_E italic_O end_POSTSUBSCRIPT 1.701 8.361 11.792 0.605 0.498 1.546 0.893 19.275 17.174 2.079 0.708 3.61
FDPsubscript𝐹𝐷𝑃F_{DP}italic_F start_POSTSUBSCRIPT italic_D italic_P end_POSTSUBSCRIPT 41.75 42.909 41.384 42.642 43.108 42.823 43.681 45.108 45.534 42.603 43.073 44.332
FOAEsubscript𝐹𝑂𝐴𝐸F_{OAE}italic_F start_POSTSUBSCRIPT italic_O italic_A italic_E end_POSTSUBSCRIPT 0.259 1.66 2.29 0.514 0.47 0.436 0.612 2.527 2.103 0.757 0.335 0.761
FEOsubscript𝐹𝐸𝑂F_{EO}italic_F start_POSTSUBSCRIPT italic_E italic_O end_POSTSUBSCRIPT 4.543 13.614 18.916 1.177 1.092 2.955 1.47 21.59 24.212 4.237 2.164 4.885
Age FMEOsubscript𝐹𝑀𝐸𝑂F_{MEO}italic_F start_POSTSUBSCRIPT italic_M italic_E italic_O end_POSTSUBSCRIPT 1.403 2.138 10.825 1.727 0.743 0.459 2.1 11.432 14.27 3.777 1.792 0.892
FDPsubscript𝐹𝐷𝑃F_{DP}italic_F start_POSTSUBSCRIPT italic_D italic_P end_POSTSUBSCRIPT 14.913 14.783 17.612 14.285 14.734 14.734 14.04 17.735 19.446 15.206 13.967 15.198
FOAEsubscript𝐹𝑂𝐴𝐸F_{OAE}italic_F start_POSTSUBSCRIPT italic_O italic_A italic_E end_POSTSUBSCRIPT 0.782 0.986 4.387 1.1 0.612 0.465 1.408 3.895 5.177 1.775 0.984 0.31
FEOsubscript𝐹𝐸𝑂F_{EO}italic_F start_POSTSUBSCRIPT italic_E italic_O end_POSTSUBSCRIPT 3.378 4.108 14.744 3.685 1.714 1.141 3.498 15.191 16.05 5.888 3.075 1.073
Intersection FMEOsubscript𝐹𝑀𝐸𝑂F_{MEO}italic_F start_POSTSUBSCRIPT italic_M italic_E italic_O end_POSTSUBSCRIPT 2.439 10.814 14.81 1.096 1.429 2.381 1.429 24.377 22.722 3.681 2.439 4.138
FDPsubscript𝐹𝐷𝑃F_{DP}italic_F start_POSTSUBSCRIPT italic_D italic_P end_POSTSUBSCRIPT 50.071 51.956 50.376 50.55 51.369 51.129 52.043 53.357 55.475 51.27 51.514 52.961
FOAEsubscript𝐹𝑂𝐴𝐸F_{OAE}italic_F start_POSTSUBSCRIPT italic_O italic_A italic_E end_POSTSUBSCRIPT 0.893 2.808 3.841 0.644 1.013 0.526 1.124 6.702 5.604 2.306 2.143 1.429
FEOsubscript𝐹𝐸𝑂F_{EO}italic_F start_POSTSUBSCRIPT italic_E italic_O end_POSTSUBSCRIPT 11.913 30.289 40.686 3.519 4.723 6.395 5.967 44.341 52.555 14.502 7.765 12.542
- ACC 98.696 98.009 95.771 98.645 99.364 99.548 99.374 94.703 96.12 97.444 99.199 99.672
AUC 99.86 99.613 99.263 99.863 99.923 99.906 99.941 99.04 98.621 99.749 99.929 99.996
AP 99.906 99.568 99.302 99.906 99.951 99.9 99.961 99.172 98.577 99.814 99.956 99.996
EER 1.373 1.733 4.142 1.351 0.675 0.45 0.72 4.66 5.088 2.139 0.653 0.36
Table 25: Detailed fairness and utility evaluation results on StyleGAN3.
Model Type
Naive Frequency Spatial Fairness-enhanced
Dataset Attribute Metric
Xception
 [61]
EfficientB4
 [62]
ViT-B/16
 [63]
F3Net
 [64]
SPSL
 [65]
SRM
 [66]
UCF
 [16]
UnivFD
 [67]
CORE
 [68]
DAW-FDD
 [20]
DAG-FDD
 [20]
PG-FDD
 [21]
MSG StyleGAN Gender FMEOsubscript𝐹𝑀𝐸𝑂F_{MEO}italic_F start_POSTSUBSCRIPT italic_M italic_E italic_O end_POSTSUBSCRIPT 1.136 12.078 0.703 3.409 0.654 2.219 0.654 14.301 7.359 0.614 0.039 0
FDPsubscript𝐹𝐷𝑃F_{DP}italic_F start_POSTSUBSCRIPT italic_D italic_P end_POSTSUBSCRIPT 27.6 31.71 25.179 26.151 27.903 26.151 27.903 15.658 29.646 27.059 27.784 28.325
FOAEsubscript𝐹𝑂𝐴𝐸F_{OAE}italic_F start_POSTSUBSCRIPT italic_O italic_A italic_E end_POSTSUBSCRIPT 0.725 3.146 3.146 2.174 0.422 1.33 0.422 8.063 1.321 0.66 0.541 0
FEOsubscript𝐹𝐸𝑂F_{EO}italic_F start_POSTSUBSCRIPT italic_E italic_O end_POSTSUBSCRIPT 1.136 13.323 0.703 3.409 0.654 2.872 0.654 15.978 7.359 0.668 0.039 0
Race FMEOsubscript𝐹𝑀𝐸𝑂F_{MEO}italic_F start_POSTSUBSCRIPT italic_M italic_E italic_O end_POSTSUBSCRIPT 0.709 12.5 18.762 9.091 0.515 17.473 0.515 18.182 13.217 2.577 12.5 0
FDPsubscript𝐹𝐷𝑃F_{DP}italic_F start_POSTSUBSCRIPT italic_D italic_P end_POSTSUBSCRIPT 49.876 46.294 36.493 50.174 49.279 32.91 49.279 41.228 43.333 48.682 48.682 49.577
FOAEsubscript𝐹𝑂𝐴𝐸F_{OAE}italic_F start_POSTSUBSCRIPT italic_O italic_A italic_E end_POSTSUBSCRIPT 0.299 10.526 13.085 5.263 0.299 16.07 0.299 7.456 8.437 2.09 5.263 0
FEOsubscript𝐹𝐸𝑂F_{EO}italic_F start_POSTSUBSCRIPT italic_E italic_O end_POSTSUBSCRIPT 1.291 22.112 27.06 9.417 1.008 25.246 1.008 50.251 14.169 7.622 12.924 0
Age FMEOsubscript𝐹𝑀𝐸𝑂F_{MEO}italic_F start_POSTSUBSCRIPT italic_M italic_E italic_O end_POSTSUBSCRIPT 2.857 37.594 5.714 2.894 0.526 16.667 0.526 33.333 12.381 3.846 15.614 0
FDPsubscript𝐹𝐷𝑃F_{DP}italic_F start_POSTSUBSCRIPT italic_D italic_P end_POSTSUBSCRIPT 47.334 49.913 48.962 47.673 49.434 50.451 49.434 42.249 51.74 48.417 51.534 49.773
FOAEsubscript𝐹𝑂𝐴𝐸F_{OAE}italic_F start_POSTSUBSCRIPT italic_O italic_A italic_E end_POSTSUBSCRIPT 2.439 6.652 1.993 2.691 0.339 4.2 0.339 4.807 4.407 3.03 2.352 0
FEOsubscript𝐹𝐸𝑂F_{EO}italic_F start_POSTSUBSCRIPT italic_E italic_O end_POSTSUBSCRIPT 3.439 51.965 7.906 4.007 1.019 19.632 1.019 37.676 17.663 9.151 27.929 0
Intersection FMEOsubscript𝐹𝑀𝐸𝑂F_{MEO}italic_F start_POSTSUBSCRIPT italic_M italic_E italic_O end_POSTSUBSCRIPT 1.493 20 41.892 11.111 0.667 25 0.667 50 50 2.667 20 0
FDPsubscript𝐹𝐷𝑃F_{DP}italic_F start_POSTSUBSCRIPT italic_D italic_P end_POSTSUBSCRIPT 55.853 54.067 57.778 55.853 55.407 33.631 55.407 68.889 68.889 54.514 54.514 55.853
FOAEsubscript𝐹𝑂𝐴𝐸F_{OAE}italic_F start_POSTSUBSCRIPT italic_O italic_A italic_E end_POSTSUBSCRIPT 0.901 14.286 17.321 7.143 0.446 22.222 0.446 20 20 2.232 7.143 0
FEOsubscript𝐹𝐸𝑂F_{EO}italic_F start_POSTSUBSCRIPT italic_E italic_O end_POSTSUBSCRIPT 3.237 54.364 57.775 15.84 2.144 39.23 2.144 123.31 59.65 11.79 23.97 0
- ACC 99.733 95.467 95.467 99.2 99.733 98.667 99.733 86.933 96.267 98.133 98.933 100
AUC 99.997 98.943 99.834 99.994 100 99.928 100 94.53 96.669 99.908 100 100
AP 99.998 97.156 99.863 99.995 100 99.939 100 93.162 95.249 99.922 100 100
EER 0.581 4.651 2.326 0 0 1.163 0 11.628 6.395 1.163 0 0
Table 26: Detailed fairness and utility evaluation results on MSG-StyleGAN.
Model Type
Naive Frequency Spatial Fairness-enhanced
Dataset Attribute Metric
Xception
 [61]
EfficientB4
 [62]
ViT-B/16
 [63]
F3Net
 [64]
SPSL
 [65]
SRM
 [66]
UCF
 [16]
UnivFD
 [67]
CORE
 [68]
DAW-FDD
 [20]
DAG-FDD
 [20]
PG-FDD
 [21]
ProGAN Gender FMEOsubscript𝐹𝑀𝐸𝑂F_{MEO}italic_F start_POSTSUBSCRIPT italic_M italic_E italic_O end_POSTSUBSCRIPT 0.381 1.296 1.134 0.428 0.315 0.403 0.243 0.834 2.333 0.236 0.242 0.138
FDPsubscript𝐹𝐷𝑃F_{DP}italic_F start_POSTSUBSCRIPT italic_D italic_P end_POSTSUBSCRIPT 16.706 17.133 15.201 16.709 16.918 16.671 17.043 15.652 16.838 16.708 17.082 16.882
FOAEsubscript𝐹𝑂𝐴𝐸F_{OAE}italic_F start_POSTSUBSCRIPT italic_O italic_A italic_E end_POSTSUBSCRIPT 0.341 0.006 1.793 0.396 0.305 0.187 0.131 1.442 0.227 0.236 0.081 0.19
FEOsubscript𝐹𝐸𝑂F_{EO}italic_F start_POSTSUBSCRIPT italic_E italic_O end_POSTSUBSCRIPT 0.504 1.501 1.137 0.610 0.613 0.437 0.257 0.947 2.389 0.304 0.271 0.167
Race FMEOsubscript𝐹𝑀𝐸𝑂F_{MEO}italic_F start_POSTSUBSCRIPT italic_M italic_E italic_O end_POSTSUBSCRIPT 3.598 5.357 9.424 2.721 3.565 4.177 2.743 18.043 20.491 3.027 3.149 0.822
FDPsubscript𝐹𝐷𝑃F_{DP}italic_F start_POSTSUBSCRIPT italic_D italic_P end_POSTSUBSCRIPT 35.504 30.506 25.824 35.179 36.143 35.551 35.743 22.053 18.852 35.97 35.661 34.624
FOAEsubscript𝐹𝑂𝐴𝐸F_{OAE}italic_F start_POSTSUBSCRIPT italic_O italic_A italic_E end_POSTSUBSCRIPT 1.285 5.514 10.036 1.232 0.609 1.279 0.693 15.502 17.141 0.48 0.926 0.844
FEOsubscript𝐹𝐸𝑂F_{EO}italic_F start_POSTSUBSCRIPT italic_E italic_O end_POSTSUBSCRIPT 5.759 9.542 14.448 4.235 4.912 5.619 3.788 22.113 25.053 4.519 4.381 1.968
Age FMEOsubscript𝐹𝑀𝐸𝑂F_{MEO}italic_F start_POSTSUBSCRIPT italic_M italic_E italic_O end_POSTSUBSCRIPT 1.374 1.953 8.853 0.898 0.798 0.932 0.656 6.036 5.397 1.262 0.656 1.244
FDPsubscript𝐹𝐷𝑃F_{DP}italic_F start_POSTSUBSCRIPT italic_D italic_P end_POSTSUBSCRIPT 21.22 22.383 19.714 21.441 21.587 21.342 21.717 19.868 21.597 21.319 21.817 21.601
FOAEsubscript𝐹𝑂𝐴𝐸F_{OAE}italic_F start_POSTSUBSCRIPT italic_O italic_A italic_E end_POSTSUBSCRIPT 0.897 0.912 5.411 0.804 0.583 0.726 0.503 4.49 2.83 1.067 0.367 0.702
FEOsubscript𝐹𝐸𝑂F_{EO}italic_F start_POSTSUBSCRIPT italic_E italic_O end_POSTSUBSCRIPT 2.875 3.954 11.106 2.453 2.53 2.744 1.763 10.015 6.825 3.491 1.646 1.582
Intersection FMEOsubscript𝐹𝑀𝐸𝑂F_{MEO}italic_F start_POSTSUBSCRIPT italic_M italic_E italic_O end_POSTSUBSCRIPT 4.284 6.829 10.793 6.557 6.581 8.513 6.604 20.936 24.845 6.406 6.581 0.935
FDPsubscript𝐹𝐷𝑃F_{DP}italic_F start_POSTSUBSCRIPT italic_D italic_P end_POSTSUBSCRIPT 50.437 47.389 40.919 50.462 51.762 51.533 52.347 38.606 36.209 51.858 51.988 50.523
FOAEsubscript𝐹𝑂𝐴𝐸F_{OAE}italic_F start_POSTSUBSCRIPT italic_O italic_A italic_E end_POSTSUBSCRIPT 1.882 5.836 11.532 2.432 1.162 1.529 0.818 16.738 19.088 0.956 0.993 0.961
FEOsubscript𝐹𝐸𝑂F_{EO}italic_F start_POSTSUBSCRIPT italic_E italic_O end_POSTSUBSCRIPT 13.14 22.712 31.013 13.286 13.577 16.796 11.759 42.997 52.525 13.479 12.476 3.729
- ACC 99.357 98.286 96.458 99.344 99.558 99.384 99.639 95.045 96.418 99.243 99.688 99.68
AUC 99.968 99.899 99.84 99.938 99.961 99.948 99.977 99.895 99.105 99.954 99.984 99.996
AP 99.976 99.928 99.861 99.959 99.974 99.959 99.984 99.927 98.838 99.966 99.988 99.997
EER 0.535 0.916 1.838 0.547 0.44 0.595 0.363 0.69 3.094 0.696 0.345 0.321
Table 27: Detailed fairness and utility evaluation results on ProGAN.
Model Type
Naive Frequency Spatial Fairness-enhanced
Dataset Attribute Metric
Xception
 [61]
EfficientB4
 [62]
ViT-B/16
 [63]
F3Net
 [64]
SPSL
 [65]
SRM
 [66]
UCF
 [16]
UnivFD
 [67]
CORE
 [68]
DAW-FDD
 [20]
DAG-FDD
 [20]
PG-FDD
 [21]
STGAN Gender FMEOsubscript𝐹𝑀𝐸𝑂F_{MEO}italic_F start_POSTSUBSCRIPT italic_M italic_E italic_O end_POSTSUBSCRIPT 17.404 18.772 4.36 13.333 12.912 5.087 6.737 12.632 7.965 12.596 14.737 5.333
FDPsubscript𝐹𝐷𝑃F_{DP}italic_F start_POSTSUBSCRIPT italic_D italic_P end_POSTSUBSCRIPT 16.581 15 5.161 12.984 13.419 7.823 11.774 0.097 14.984 12.935 15.645 11.194
FOAEsubscript𝐹𝑂𝐴𝐸F_{OAE}italic_F start_POSTSUBSCRIPT italic_O italic_A italic_E end_POSTSUBSCRIPT 7.968 10.419 2.452 8.532 7.387 4.661 2.452 12.774 0.048 5.903 6.323 2.581
FEOsubscript𝐹𝐸𝑂F_{EO}italic_F start_POSTSUBSCRIPT italic_E italic_O end_POSTSUBSCRIPT 17.404 21.708 5.238 17.171 15.412 9.087 7.812 19.085 15.123 13.934 15.812 5.333
Race FMEOsubscript𝐹𝑀𝐸𝑂F_{MEO}italic_F start_POSTSUBSCRIPT italic_M italic_E italic_O end_POSTSUBSCRIPT 13.299 20 4.167 11.765 17.647 4.412 17.647 50 40 26.961 10 5.882
FDPsubscript𝐹𝐷𝑃F_{DP}italic_F start_POSTSUBSCRIPT italic_D italic_P end_POSTSUBSCRIPT 25 32.197 18.561 20.613 19.048 19.697 22.811 13.258 29.337 13.62 18.215 20.613
FOAEsubscript𝐹𝑂𝐴𝐸F_{OAE}italic_F start_POSTSUBSCRIPT italic_O italic_A italic_E end_POSTSUBSCRIPT 8.97 4.5 2.464 8 12 3.297 12 36.742 10.023 19.833 7.955 4
FEOsubscript𝐹𝐸𝑂F_{EO}italic_F start_POSTSUBSCRIPT italic_E italic_O end_POSTSUBSCRIPT 18.036 47.951 6.249 27.305 24.367 17.005 33.472 83.525 53.832 35.574 17.67 8.072
Age FMEOsubscript𝐹𝑀𝐸𝑂F_{MEO}italic_F start_POSTSUBSCRIPT italic_M italic_E italic_O end_POSTSUBSCRIPT 18.277 28.125 22.581 20 23.333 8.696 20 22.5 14.146 19.916 20.784 10
FDPsubscript𝐹𝐷𝑃F_{DP}italic_F start_POSTSUBSCRIPT italic_D italic_P end_POSTSUBSCRIPT 10.656 11.688 10.343 14.52 19.438 10.82 16.159 10.134 8.765 13.574 14.844 12.881
FOAEsubscript𝐹𝑂𝐴𝐸F_{OAE}italic_F start_POSTSUBSCRIPT italic_O italic_A italic_E end_POSTSUBSCRIPT 10.99 20 11.475 13.115 11.475 5.455 11.475 14.637 8.67 12.404 9.115 4.918
FEOsubscript𝐹𝐸𝑂F_{EO}italic_F start_POSTSUBSCRIPT italic_E italic_O end_POSTSUBSCRIPT 26.439 43.956 36.21 42.739 41.293 15.487 29.446 43.298 35.301 48.478 32.922 13.125
Intersection FMEOsubscript𝐹𝑀𝐸𝑂F_{MEO}italic_F start_POSTSUBSCRIPT italic_M italic_E italic_O end_POSTSUBSCRIPT 28.571 30.612 16.667 16.327 23.077 7.812 25 66.667 66.667 38.462 24.49 7.692
FDPsubscript𝐹𝐷𝑃F_{DP}italic_F start_POSTSUBSCRIPT italic_D italic_P end_POSTSUBSCRIPT 40.833 45 39.167 35.177 32.78 35 36.947 33.333 47.677 23.82 33.038 35
FOAEsubscript𝐹𝑂𝐴𝐸F_{OAE}italic_F start_POSTSUBSCRIPT italic_O italic_A italic_E end_POSTSUBSCRIPT 16.667 22.222 7.143 10.619 15.789 7.08 16.667 48.333 20.833 26.316 16.667 5.263
FEOsubscript𝐹𝐸𝑂F_{EO}italic_F start_POSTSUBSCRIPT italic_E italic_O end_POSTSUBSCRIPT 73.153 131.743 58.478 75.809 70.579 43.757 85.001 236.087 150.768 101.162 90.191 22.28
- ACC 93.521 88.451 94.93 95.775 95.775 97.465 94.93 69.577 93.521 90.423 93.239 98.873
AUC 99.573 96.465 98.534 99.541 99.547 97.807 99.538 85.921 97.335 99.194 99.522 99.908
AP 99.639 95.872 98.139 99.59 99.607 97.132 99.579 82.016 96.354 99.242 99.554 99.922
EER 4.217 7.831 3.614 3.614 3.614 3.614 3.614 19.277 6.024 3.614 3.614 3.614
Table 28: Detailed fairness and utility evaluation results on STGAN.
Model Type
Naive Frequency Spatial Fairness-enhanced
Dataset Attribute Metric
Xception
 [61]
EfficientB4
 [62]
ViT-B/16
 [63]
F3Net
 [64]
SPSL
 [65]
SRM
 [66]
UCF
 [16]
UnivFD
 [67]
CORE
 [68]
DAW-FDD
 [20]
DAG-FDD
 [20]
PG-FDD
 [21]
VQGAN Gender FMEOsubscript𝐹𝑀𝐸𝑂F_{MEO}italic_F start_POSTSUBSCRIPT italic_M italic_E italic_O end_POSTSUBSCRIPT 0.359 1.689 0.682 0.732 0.368 0.47 0.241 1.789 1.021 0.642 0.338 0.207
FDPsubscript𝐹𝐷𝑃F_{DP}italic_F start_POSTSUBSCRIPT italic_D italic_P end_POSTSUBSCRIPT 12.221 13.117 11.324 12.105 12.491 12.466 12.581 9.224 12.203 12.216 12.609 12.755
FOAEsubscript𝐹𝑂𝐴𝐸F_{OAE}italic_F start_POSTSUBSCRIPT italic_O italic_A italic_E end_POSTSUBSCRIPT 0.231 0.067 1.264 0.49 0.377 0.454 0.187 0.12 0.625 0.482 0.243 0.069
FEOsubscript𝐹𝐸𝑂F_{EO}italic_F start_POSTSUBSCRIPT italic_E italic_O end_POSTSUBSCRIPT 0.371 2.198 0.722 0.829 0.719 0.876 0.367 2.226 1.274 0.853 0.513 0.386
Race FMEOsubscript𝐹𝑀𝐸𝑂F_{MEO}italic_F start_POSTSUBSCRIPT italic_M italic_E italic_O end_POSTSUBSCRIPT 1.267 8.893 10.064 2.257 3.158 2.321 3.273 20.147 20.549 4.11 3.183 1.429
FDPsubscript𝐹𝐷𝑃F_{DP}italic_F start_POSTSUBSCRIPT italic_D italic_P end_POSTSUBSCRIPT 59.881 60.677 57.134 60.756 60.89 60.225 61.269 54.127 62.9 60.342 61.099 61.113
FOAEsubscript𝐹𝑂𝐴𝐸F_{OAE}italic_F start_POSTSUBSCRIPT italic_O italic_A italic_E end_POSTSUBSCRIPT 0.692 1.985 3.926 0.611 0.47 0.855 0.58 9.087 2.245 0.726 0.518 0.283
FEOsubscript𝐹𝐸𝑂F_{EO}italic_F start_POSTSUBSCRIPT italic_E italic_O end_POSTSUBSCRIPT 3.139 13.952 15.64 3.751 4.581 4.174 4.947 34.162 25.23 6.323 4.388 2.143
Age FMEOsubscript𝐹𝑀𝐸𝑂F_{MEO}italic_F start_POSTSUBSCRIPT italic_M italic_E italic_O end_POSTSUBSCRIPT 1.104 1.105 9.03 1.813 1.533 1.533 0.648 5.925 8.977 1.61 0.703 0.897
FDPsubscript𝐹𝐷𝑃F_{DP}italic_F start_POSTSUBSCRIPT italic_D italic_P end_POSTSUBSCRIPT 29.715 29.264 30.54 29.734 30.239 30.338 30.415 22.037 31.068 30.059 30.343 30.537
FOAEsubscript𝐹𝑂𝐴𝐸F_{OAE}italic_F start_POSTSUBSCRIPT italic_O italic_A italic_E end_POSTSUBSCRIPT 0.956 0.953 2.392 0.984 0.785 0.866 0.339 4.197 3.12 0.992 0.444 0.461
FEOsubscript𝐹𝐸𝑂F_{EO}italic_F start_POSTSUBSCRIPT italic_E italic_O end_POSTSUBSCRIPT 2.363 4.055 11.65 3.145 2.892 3.001 1.344 18.036 10.529 3.499 1.421 1.653
Intersection FMEOsubscript𝐹𝑀𝐸𝑂F_{MEO}italic_F start_POSTSUBSCRIPT italic_M italic_E italic_O end_POSTSUBSCRIPT 3.846 13.893 12.515 3.44 3.504 3.112 3.671 23.965 25.638 4.721 3.525 2
FDPsubscript𝐹𝐷𝑃F_{DP}italic_F start_POSTSUBSCRIPT italic_D italic_P end_POSTSUBSCRIPT 67.678 69.217 64.056 68.414 68.785 67.971 69.072 60.076 71.221 68.725 69.149 69.359
FOAEsubscript𝐹𝑂𝐴𝐸F_{OAE}italic_F start_POSTSUBSCRIPT italic_O italic_A italic_E end_POSTSUBSCRIPT 1.703 3.567 5.313 0.889 1.029 1.408 0.787 10.854 3.869 2.36 1.186 0.976
FEOsubscript𝐹𝐸𝑂F_{EO}italic_F start_POSTSUBSCRIPT italic_E italic_O end_POSTSUBSCRIPT 9.427 27.715 31.838 9.934 10.748 11.46 11.34 66.065 46.268 16.097 10.787 5.751
- ACC 99.092 97.936 96.313 99.102 99.344 99.387 99.543 91.217 96.248 99.135 99.538 99.758
AUC 99.909 99.746 99.699 99.883 99.878 99.879 99.912 96.257 98.872 99.912 99.938 99.99
AP 99.926 99.755 99.716 99.901 99.871 99.855 99.895 96.027 98.822 99.932 99.952 99.991
EER 0.835 1.565 2.554 0.706 0.683 0.588 0.447 9.508 4.46 0.812 0.447 0.306
Table 29: Detailed fairness and utility evaluation results on VQGAN.
Model Type
Naive Frequency Spatial Fairness-enhanced
Dataset Attribute Metric
Xception
 [61]
EfficientB4
 [62]
ViT-B/16
 [63]
F3Net
 [64]
SPSL
 [65]
SRM
 [66]
UCF
 [16]
UnivFD
 [67]
CORE
 [68]
DAW-FDD
 [20]
DAG-FDD
 [20]
PG-FDD
 [21]
Commercial Tools Gender FMEOsubscript𝐹𝑀𝐸𝑂F_{MEO}italic_F start_POSTSUBSCRIPT italic_M italic_E italic_O end_POSTSUBSCRIPT 7.689 8.221 4.167 4.432 2.879 0.435 3.258 8.864 2.708 2.348 4.432 2.083
FDPsubscript𝐹𝐷𝑃F_{DP}italic_F start_POSTSUBSCRIPT italic_D italic_P end_POSTSUBSCRIPT 18.977 22.798 18.901 18.527 16.616 16.540 18.714 20.701 16.540 16.990 17.440 17.177
FOAEsubscript𝐹𝑂𝐴𝐸F_{OAE}italic_F start_POSTSUBSCRIPT italic_O italic_A italic_E end_POSTSUBSCRIPT 5.961 1.613 3.337 3.337 2.326 1.350 2.250 5.137 3.524 2.700 4.424 2.887
FEOsubscript𝐹𝐸𝑂F_{EO}italic_F start_POSTSUBSCRIPT italic_E italic_O end_POSTSUBSCRIPT 10.397 14.471 5.714 4.867 3.459 0.435 4.950 10.411 4.791 2.929 7.140 2.664
Race FMEOsubscript𝐹𝑀𝐸𝑂F_{MEO}italic_F start_POSTSUBSCRIPT italic_M italic_E italic_O end_POSTSUBSCRIPT 33.333 14.286 33.333 28.571 33.333 28.571 28.571 28.571 33.333 33.333 33.333 33.333
FDPsubscript𝐹𝐷𝑃F_{DP}italic_F start_POSTSUBSCRIPT italic_D italic_P end_POSTSUBSCRIPT 69.398 61.706 71.572 69.398 65.050 73.746 69.398 65.050 73.746 73.746 71.572 74.089
FOAEsubscript𝐹𝑂𝐴𝐸F_{OAE}italic_F start_POSTSUBSCRIPT italic_O italic_A italic_E end_POSTSUBSCRIPT 21.053 7.018 10.526 10.526 15.789 4.348 8.696 13.043 10.526 15.789 15.789 5.263
FEOsubscript𝐹𝐸𝑂F_{EO}italic_F start_POSTSUBSCRIPT italic_E italic_O end_POSTSUBSCRIPT 77.098 29.869 62.147 52.682 73.558 36.044 48.191 50.286 63.706 70.345 70.475 57.018
Age FMEOsubscript𝐹𝑀𝐸𝑂F_{MEO}italic_F start_POSTSUBSCRIPT italic_M italic_E italic_O end_POSTSUBSCRIPT 28.571 28.571 14.286 28.571 6.711 12.500 14.286 14.286 14.286 14.286 14.286 5.833
FDPsubscript𝐹𝐷𝑃F_{DP}italic_F start_POSTSUBSCRIPT italic_D italic_P end_POSTSUBSCRIPT 49.346 58.824 52.778 50.817 47.876 52.288 51.797 49.346 52.288 51.307 50.327 52.288
FOAEsubscript𝐹𝑂𝐴𝐸F_{OAE}italic_F start_POSTSUBSCRIPT italic_O italic_A italic_E end_POSTSUBSCRIPT 9.225 10.205 8.170 8.170 6.566 11.111 8.170 6.566 9.641 7.680 7.680 8.660
FEOsubscript𝐹𝐸𝑂F_{EO}italic_F start_POSTSUBSCRIPT italic_E italic_O end_POSTSUBSCRIPT 37.679 36.583 21.513 43.361 20.392 13.639 27.762 28.094 22.990 22.156 22.837 8.656
Intersection FMEOsubscript𝐹𝑀𝐸𝑂F_{MEO}italic_F start_POSTSUBSCRIPT italic_M italic_E italic_O end_POSTSUBSCRIPT 50.000 25.000 50.000 33.333 50.000 33.333 33.333 33.333 50.000 50.000 50.000 50.000
FDPsubscript𝐹𝐷𝑃F_{DP}italic_F start_POSTSUBSCRIPT italic_D italic_P end_POSTSUBSCRIPT 65.714 62.637 68.889 68.000 68.000 72.000 68.000 65.714 72.000 72.000 72.000 74.444
FOAEsubscript𝐹𝑂𝐴𝐸F_{OAE}italic_F start_POSTSUBSCRIPT italic_O italic_A italic_E end_POSTSUBSCRIPT 22.222 8.547 11.111 11.111 19.048 5.128 9.524 20.000 11.111 16.667 16.667 7.692
FEOsubscript𝐹𝐸𝑂F_{EO}italic_F start_POSTSUBSCRIPT italic_E italic_O end_POSTSUBSCRIPT 131.495 67.938 105.685 81.793 129.065 59.940 76.252 91.701 105.535 114.665 126.561 109.221
- ACC 93.976 95.582 95.582 95.582 92.771 97.590 95.984 92.369 96.787 95.181 95.181 96.386
AUC 95.778 99.541 99.005 96.349 94.798 95.716 95.681 96.808 97.371 97.141 94.812 93.365
AP 96.193 99.751 99.401 96.966 95.607 96.184 93.761 98.000 98.153 97.779 94.066 90.493
EER 7.692 3.297 5.495 6.593 8.791 6.593 6.593 9.890 6.593 6.593 6.593 7.692
Table 30: Detailed fairness and utility evaluation results on Commercial Tools (DALLE2, IF & Midjourney).
Model Type
Naive Frequency Spatial Fairness-enhanced
Dataset Attribute Metric
Xception
 [61]
EfficientB4
 [62]
ViT-B/16
 [63]
F3Net
 [64]
SPSL
 [65]
SRM
 [66]
UCF
 [16]
UnivFD
 [67]
CORE
 [68]
DAW-FDD
 [20]
DAG-FDD
 [20]
PG-FDD
 [21]
DCFace Gender FMEOsubscript𝐹𝑀𝐸𝑂F_{MEO}italic_F start_POSTSUBSCRIPT italic_M italic_E italic_O end_POSTSUBSCRIPT 0.525 0.87 1.201 0.637 0.196 0.338 0.052 0.596 1.77 0.368 0.066 0.04
FDPsubscript𝐹𝐷𝑃F_{DP}italic_F start_POSTSUBSCRIPT italic_D italic_P end_POSTSUBSCRIPT 5.465 4.56 5.28 5.534 5.342 5.371 5.274 4.013 4.012 5.351 5.265 5.213
FOAEsubscript𝐹𝑂𝐴𝐸F_{OAE}italic_F start_POSTSUBSCRIPT italic_O italic_A italic_E end_POSTSUBSCRIPT 0.191 0.412 0.311 0.214 0.062 0.163 0.015 0.191 1.155 0.161 0.048 0.018
FEOsubscript𝐹𝐸𝑂F_{EO}italic_F start_POSTSUBSCRIPT italic_E italic_O end_POSTSUBSCRIPT 0.549 0.873 1.393 0.704 0.231 0.343 0.066 0.944 1.86 0.415 0.085 0.076
Race FMEOsubscript𝐹𝑀𝐸𝑂F_{MEO}italic_F start_POSTSUBSCRIPT italic_M italic_E italic_O end_POSTSUBSCRIPT 0.667 5.55 8.151 0.608 0.737 0.591 0.663 16.794 18.582 0.78 0.669 0.938
FDPsubscript𝐹𝐷𝑃F_{DP}italic_F start_POSTSUBSCRIPT italic_D italic_P end_POSTSUBSCRIPT 18.219 18.877 20.375 18.181 18.285 18.201 18.441 26.177 25.501 18.289 18.292 18.522
FOAEsubscript𝐹𝑂𝐴𝐸F_{OAE}italic_F start_POSTSUBSCRIPT italic_O italic_A italic_E end_POSTSUBSCRIPT 0.535 2.088 3.419 0.384 0.294 0.359 0.431 5.528 6.893 0.322 0.304 0.392
FEOsubscript𝐹𝐸𝑂F_{EO}italic_F start_POSTSUBSCRIPT italic_E italic_O end_POSTSUBSCRIPT 2.333 10.815 14.05 1.889 1.94 1.649 2.154 29.148 23.112 1.838 1.661 1.454
Age FMEOsubscript𝐹𝑀𝐸𝑂F_{MEO}italic_F start_POSTSUBSCRIPT italic_M italic_E italic_O end_POSTSUBSCRIPT 1.448 2.989 9.273 1.071 0.567 1.055 0.425 6.914 7.59 1.253 0.706 0.594
FDPsubscript𝐹𝐷𝑃F_{DP}italic_F start_POSTSUBSCRIPT italic_D italic_P end_POSTSUBSCRIPT 12.708 13.621 9.357 12.926 13.119 12.754 13.08 10.929 8.445 12.763 13.314 12.741
FOAEsubscript𝐹𝑂𝐴𝐸F_{OAE}italic_F start_POSTSUBSCRIPT italic_O italic_A italic_E end_POSTSUBSCRIPT 0.918 1.218 4.94 0.772 0.501 0.831 0.358 6.765 4.839 0.973 0.4 0.388
FEOsubscript𝐹𝐸𝑂F_{EO}italic_F start_POSTSUBSCRIPT italic_E italic_O end_POSTSUBSCRIPT 2.782 6.577 12.53 2.419 1.552 2.202 1.108 15.928 8.071 2.468 1.232 0.924
Intersection FMEOsubscript𝐹𝑀𝐸𝑂F_{MEO}italic_F start_POSTSUBSCRIPT italic_M italic_E italic_O end_POSTSUBSCRIPT 1.327 7.377 9.272 1.463 0.892 0.923 0.866 19.337 22.619 0.886 0.868 1.29
FDPsubscript𝐹𝐷𝑃F_{DP}italic_F start_POSTSUBSCRIPT italic_D italic_P end_POSTSUBSCRIPT 21.136 20.833 22.531 21.04 20.964 20.906 21.006 27.504 28.454 21.116 20.984 21.006
FOAEsubscript𝐹𝑂𝐴𝐸F_{OAE}italic_F start_POSTSUBSCRIPT italic_O italic_A italic_E end_POSTSUBSCRIPT 0.764 3.089 4.017 0.649 0.362 0.518 0.498 6.588 8.981 0.548 0.391 0.568
FEOsubscript𝐹𝐸𝑂F_{EO}italic_F start_POSTSUBSCRIPT italic_E italic_O end_POSTSUBSCRIPT 5.666 22.247 27.649 6.712 4.452 4.176 4.367 58.647 45.831 5.043 3.709 2.873
- ACC 99.361 96.935 96.038 99.314 99.542 99.395 99.627 92.834 96.443 99.329 99.654 99.727
AUC 99.961 99.513 99.718 99.938 99.934 99.956 99.965 97.415 99.129 99.956 99.965 99.994
AP 99.972 99.602 99.776 99.955 99.947 99.963 99.97 97.347 98.913 99.969 99.977 99.995
EER 0.422 3.07 2.649 0.414 0.414 0.612 0.363 7.661 3.26 0.515 0.368 0.322
Table 31: Detailed fairness and utility evaluation results on DCFace.
Model Type
Naive Frequency Spatial Fairness-enhanced
Dataset Attribute Metric
Xception
 [61]
EfficientB4
 [62]
ViT-B/16
 [63]
F3Net
 [64]
SPSL
 [65]
SRM
 [66]
UCF
 [16]
UnivFD
 [67]
CORE
 [68]
DAW-FDD
 [20]
DAG-FDD
 [20]
PG-FDD
 [21]
Latent Diffusion Gender FMEOsubscript𝐹𝑀𝐸𝑂F_{MEO}italic_F start_POSTSUBSCRIPT italic_M italic_E italic_O end_POSTSUBSCRIPT 0.711 1.315 6.822 0.587 0.269 0.404 0.343 0.523 1.341 0.709 0.343 0.005
FDPsubscript𝐹𝐷𝑃F_{DP}italic_F start_POSTSUBSCRIPT italic_D italic_P end_POSTSUBSCRIPT 6.704 7.776 1.172 6.844 7.154 7.227 7.274 6.063 7.26 6.444 7.367 7.15
FOAEsubscript𝐹𝑂𝐴𝐸F_{OAE}italic_F start_POSTSUBSCRIPT italic_O italic_A italic_E end_POSTSUBSCRIPT 0.339 0.192 3.448 0.376 0.232 0.315 0.206 0.052 0.059 0.069 0.113 0.02
FEOsubscript𝐹𝐸𝑂F_{EO}italic_F start_POSTSUBSCRIPT italic_E italic_O end_POSTSUBSCRIPT 0.723 1.748 8.477 0.706 0.467 0.658 0.411 0.993 1.39 1.186 0.464 0.005
Race FMEOsubscript𝐹𝑀𝐸𝑂F_{MEO}italic_F start_POSTSUBSCRIPT italic_M italic_E italic_O end_POSTSUBSCRIPT 1.377 7.837 10.67 1.602 0.319 0.559 1.116 19.547 20.291 0.763 0.633 0.503
FDPsubscript𝐹𝐷𝑃F_{DP}italic_F start_POSTSUBSCRIPT italic_D italic_P end_POSTSUBSCRIPT 39.89 41.058 30.225 39.89 40.64 40.819 40.462 39.759 42.918 40.116 40.95 40.593
FOAEsubscript𝐹𝑂𝐴𝐸F_{OAE}italic_F start_POSTSUBSCRIPT italic_O italic_A italic_E end_POSTSUBSCRIPT 1.202 1.262 9.842 0.921 0.206 0.387 0.658 4.85 2.278 0.691 0.299 0.31
FEOsubscript𝐹𝐸𝑂F_{EO}italic_F start_POSTSUBSCRIPT italic_E italic_O end_POSTSUBSCRIPT 2.443 10.604 26.477 3.081 0.884 1.541 2.659 30.757 21.909 2.097 1.363 0.765
Age FMEOsubscript𝐹𝑀𝐸𝑂F_{MEO}italic_F start_POSTSUBSCRIPT italic_M italic_E italic_O end_POSTSUBSCRIPT 2.771 2.325 12.515 1.798 0.9 0.803 0.762 3.544 4.117 1.896 0.571 1.604
FDPsubscript𝐹𝐷𝑃F_{DP}italic_F start_POSTSUBSCRIPT italic_D italic_P end_POSTSUBSCRIPT 20.503 20.83 20.755 20.183 20.119 19.742 19.955 15.919 20.598 20.183 20.062 20.823
FOAEsubscript𝐹𝑂𝐴𝐸F_{OAE}italic_F start_POSTSUBSCRIPT italic_O italic_A italic_E end_POSTSUBSCRIPT 0.913 0.495 5.504 0.508 0.434 0.505 0.275 2.119 0.437 1.088 0.319 0.518
FEOsubscript𝐹𝐸𝑂F_{EO}italic_F start_POSTSUBSCRIPT italic_E italic_O end_POSTSUBSCRIPT 4.881 4.303 35.584 3.991 1.97 1.782 1.411 11.437 6.184 4.923 1.425 2.165
Intersection FMEOsubscript𝐹𝑀𝐸𝑂F_{MEO}italic_F start_POSTSUBSCRIPT italic_M italic_E italic_O end_POSTSUBSCRIPT 3.571 10.805 13.52 3.846 1.786 1.786 1.786 22.411 24.881 3.226 1.786 0.714
FDPsubscript𝐹𝐷𝑃F_{DP}italic_F start_POSTSUBSCRIPT italic_D italic_P end_POSTSUBSCRIPT 52.751 53.892 39.806 52.751 53.771 54.281 53.441 50.799 55.935 52.87 54.461 53.621
FOAEsubscript𝐹𝑂𝐴𝐸F_{OAE}italic_F start_POSTSUBSCRIPT italic_O italic_A italic_E end_POSTSUBSCRIPT 3.061 2.274 13.761 2.643 1.322 1.322 1.531 5.062 2.708 1.442 0.541 0.51
FEOsubscript𝐹𝐸𝑂F_{EO}italic_F start_POSTSUBSCRIPT italic_E italic_O end_POSTSUBSCRIPT 10.619 23.761 61.046 11.037 7.073 5.892 8.286 57.444 45.514 11.596 5.021 1.768
- ACC 99.066 98.528 88.706 99.179 99.505 99.674 99.646 92.669 96.519 98.981 99.689 99.887
AUC 99.921 99.948 96.795 99.908 99.942 99.968 99.972 97.153 98.926 99.916 99.971 99.999
AP 99.945 99.961 96.469 99.931 99.965 99.976 99.983 96.901 98.668 99.94 99.981 99.999
EER 0.906 0.531 9.031 0.688 0.5 0.406 0.375 8 3.719 1 0.469 0.156
Table 32: Detailed fairness and utility evaluation results on Latent Diffusion.
Model Type
Naive Frequency Spatial Fairness-enhanced
Dataset Attribute Metric
Xception
 [61]
EfficientB4
 [62]
ViT-B/16
 [63]
F3Net
 [64]
SPSL
 [65]
SRM
 [66]
UCF
 [16]
UnivFD
 [67]
CORE
 [68]
DAW-FDD
 [20]
DAG-FDD
 [20]
PG-FDD
 [21]
Palette Gender FMEOsubscript𝐹𝑀𝐸𝑂F_{MEO}italic_F start_POSTSUBSCRIPT italic_M italic_E italic_O end_POSTSUBSCRIPT 1.164 0.544 1.2 1.164 1.121 2.159 1.611 9.265 0.757 1.348 0.503 0.727
FDPsubscript𝐹𝐷𝑃F_{DP}italic_F start_POSTSUBSCRIPT italic_D italic_P end_POSTSUBSCRIPT 13.196 13.889 12.155 13.052 13.571 13.763 13.705 2.952 13.466 12.548 13.542 13.928
FOAEsubscript𝐹𝑂𝐴𝐸F_{OAE}italic_F start_POSTSUBSCRIPT italic_O italic_A italic_E end_POSTSUBSCRIPT 1.164 0.278 2.166 1.02 0.848 1.308 1.077 9.412 0.95 0.803 0.55 0.147
FEOsubscript𝐹𝐸𝑂F_{EO}italic_F start_POSTSUBSCRIPT italic_E italic_O end_POSTSUBSCRIPT 1.999 0.866 1.844 1.45 1.814 3.043 2.412 10.763 1.438 1.536 0.884 0.791
Race FMEOsubscript𝐹𝑀𝐸𝑂F_{MEO}italic_F start_POSTSUBSCRIPT italic_M italic_E italic_O end_POSTSUBSCRIPT 3.659 5.947 11.668 3.333 6.098 7.317 6.098 20.528 20.242 5.108 2.83 7.317
FDPsubscript𝐹𝐷𝑃F_{DP}italic_F start_POSTSUBSCRIPT italic_D italic_P end_POSTSUBSCRIPT 5.979 6.385 4.834 7.61 5.135 4.144 5.922 16.802 8.742 5.776 7.261 5.922
FOAEsubscript𝐹𝑂𝐴𝐸F_{OAE}italic_F start_POSTSUBSCRIPT italic_O italic_A italic_E end_POSTSUBSCRIPT 1.965 4.123 8.436 2.97 2.963 2.046 2.329 6.372 13.877 2.402 2.062 3.319
FEOsubscript𝐹𝐸𝑂F_{EO}italic_F start_POSTSUBSCRIPT italic_E italic_O end_POSTSUBSCRIPT 8.157 11.825 18.769 7.416 9.477 14.029 10.363 43.621 26.601 12.368 6.161 10.827
Age FMEOsubscript𝐹𝑀𝐸𝑂F_{MEO}italic_F start_POSTSUBSCRIPT italic_M italic_E italic_O end_POSTSUBSCRIPT 4.688 3.002 7.966 3.756 4.042 3.765 4.425 12.333 8.923 4.995 4.042 1.948
FDPsubscript𝐹𝐷𝑃F_{DP}italic_F start_POSTSUBSCRIPT italic_D italic_P end_POSTSUBSCRIPT 19.14 19.394 18.86 20.025 18.688 19.789 20.438 14.893 21.909 20.674 20.674 17.134
FOAEsubscript𝐹𝑂𝐴𝐸F_{OAE}italic_F start_POSTSUBSCRIPT italic_O italic_A italic_E end_POSTSUBSCRIPT 3.534 1.426 4.149 2.78 3.775 3.392 2.765 10.627 4.861 2.715 2.271 0.865
FEOsubscript𝐹𝐸𝑂F_{EO}italic_F start_POSTSUBSCRIPT italic_E italic_O end_POSTSUBSCRIPT 11.998 6.893 14.031 9.808 11.945 12.703 12.283 38.971 10.877 11.605 9.98 3.288
Intersection FMEOsubscript𝐹𝑀𝐸𝑂F_{MEO}italic_F start_POSTSUBSCRIPT italic_M italic_E italic_O end_POSTSUBSCRIPT 9.375 8.995 20.066 5.556 12.5 12.5 12.5 58 22.231 12.5 4.412 9.375
FDPsubscript𝐹𝐷𝑃F_{DP}italic_F start_POSTSUBSCRIPT italic_D italic_P end_POSTSUBSCRIPT 37.067 39.108 27.455 36.047 37.067 37.352 38.372 23.106 27.997 36.047 39.393 35.026
FOAEsubscript𝐹𝑂𝐴𝐸F_{OAE}italic_F start_POSTSUBSCRIPT italic_O italic_A italic_E end_POSTSUBSCRIPT 5.769 5.454 18.902 4.808 4.808 4.808 4.808 22.672 17.864 4.137 3.54 4.082
FEOsubscript𝐹𝐸𝑂F_{EO}italic_F start_POSTSUBSCRIPT italic_E italic_O end_POSTSUBSCRIPT 24.367 23.905 43.338 18.623 24.071 33.455 25.185 150.941 53.279 30.359 16.799 27.109
- ACC 98.547 97.465 94.189 98.671 98.578 98.423 98.702 73.447 94.405 97.682 98.887 99.073
AUC 99.736 99.581 99.501 99.756 99.644 99.387 99.856 80.642 97.922 99.704 99.781 99.923
AP 99.423 98.911 99.063 99.497 99.079 98.07 99.725 67.995 95.558 99.432 99.657 99.867
EER 1.525 1.672 2.951 1.279 1.426 1.574 1.328 26.365 6.05 2.361 1.279 1.082
Table 33: Detailed fairness and utility evaluation results on Palette.
Model Type
Naive Frequency Spatial Fairness-enhanced
Dataset Attribute Metric
Xception
 [61]
EfficientB4
 [62]
ViT-B/16
 [63]
F3Net
 [64]
SPSL
 [65]
SRM
 [66]
UCF
 [16]
UnivFD
 [67]
CORE
 [68]
DAW-FDD
 [20]
DAG-FDD
 [20]
PG-FDD
 [21]
SD1.5 Gender FMEOsubscript𝐹𝑀𝐸𝑂F_{MEO}italic_F start_POSTSUBSCRIPT italic_M italic_E italic_O end_POSTSUBSCRIPT 0.509 3.139 0.317 1.139 1.186 0.248 1.674 2.635 0.891 0.457 1.275 0.345
FDPsubscript𝐹𝐷𝑃F_{DP}italic_F start_POSTSUBSCRIPT italic_D italic_P end_POSTSUBSCRIPT 22.081 19.548 20.978 22.834 22.776 22.724 22.857 11.878 20.185 20.993 23.084 22.934
FOAEsubscript𝐹𝑂𝐴𝐸F_{OAE}italic_F start_POSTSUBSCRIPT italic_O italic_A italic_E end_POSTSUBSCRIPT 0.495 1.293 1.576 0.178 0.153 0.021 0.332 4.404 2.235 1.599 0.07 0.119
FEOsubscript𝐹𝐸𝑂F_{EO}italic_F start_POSTSUBSCRIPT italic_E italic_O end_POSTSUBSCRIPT 0.613 3.185 0.554 1.705 1.56 0.359 1.886 3.738 1.065 0.625 1.661 0.399
Race FMEOsubscript𝐹𝑀𝐸𝑂F_{MEO}italic_F start_POSTSUBSCRIPT italic_M italic_E italic_O end_POSTSUBSCRIPT 1.968 7.1 8.289 2.12 1.5 1.342 1.959 19.503 18.674 2.837 1.469 1.323
FDPsubscript𝐹𝐷𝑃F_{DP}italic_F start_POSTSUBSCRIPT italic_D italic_P end_POSTSUBSCRIPT 15.903 16.759 15.365 15.889 15.045 14.817 15.624 11.712 14.571 14.654 15.689 14.94
FOAEsubscript𝐹𝑂𝐴𝐸F_{OAE}italic_F start_POSTSUBSCRIPT italic_O italic_A italic_E end_POSTSUBSCRIPT 0.943 2.918 4.108 1.294 0.879 0.297 1.454 10.28 9.562 1.645 1.347 0.901
FEOsubscript𝐹𝐸𝑂F_{EO}italic_F start_POSTSUBSCRIPT italic_E italic_O end_POSTSUBSCRIPT 4.614 15.815 13.016 4.227 2.198 2.227 3.486 40.819 27.388 6.944 3.342 3.124
Age FMEOsubscript𝐹𝑀𝐸𝑂F_{MEO}italic_F start_POSTSUBSCRIPT italic_M italic_E italic_O end_POSTSUBSCRIPT 2.87 3.457 5.482 1.658 3.43 1.295 3.244 5.749 11.164 4.406 2.534 1.061
FDPsubscript𝐹𝐷𝑃F_{DP}italic_F start_POSTSUBSCRIPT italic_D italic_P end_POSTSUBSCRIPT 31.026 28.27 31.043 30.768 32.076 31.734 31.627 16.203 30.92 31.262 31.78 32.059
FOAEsubscript𝐹𝑂𝐴𝐸F_{OAE}italic_F start_POSTSUBSCRIPT italic_O italic_A italic_E end_POSTSUBSCRIPT 2.054 2.83 1.832 1.244 2.275 0.828 2.1 11.13 2.965 2.806 1.818 0.942
FEOsubscript𝐹𝐸𝑂F_{EO}italic_F start_POSTSUBSCRIPT italic_E italic_O end_POSTSUBSCRIPT 6.24 9.127 9.167 3.913 6.148 2.276 5.276 12.176 15.841 7.182 5.215 2.787
Intersection FMEOsubscript𝐹𝑀𝐸𝑂F_{MEO}italic_F start_POSTSUBSCRIPT italic_M italic_E italic_O end_POSTSUBSCRIPT 3.333 11.68 11.018 3.333 6.667 2.439 6.206 24.497 23.778 3.283 6.206 1.695
FDPsubscript𝐹𝐷𝑃F_{DP}italic_F start_POSTSUBSCRIPT italic_D italic_P end_POSTSUBSCRIPT 34.936 32.557 32.823 35.564 34.066 33.963 34.985 24.536 30.941 32.333 35.227 34.35
FOAEsubscript𝐹𝑂𝐴𝐸F_{OAE}italic_F start_POSTSUBSCRIPT italic_O italic_A italic_E end_POSTSUBSCRIPT 1.928 5.084 4.661 1.915 3.382 1.667 2.27 14.398 11.932 3.428 2.27 1.208
FEOsubscript𝐹𝐸𝑂F_{EO}italic_F start_POSTSUBSCRIPT italic_E italic_O end_POSTSUBSCRIPT 10.679 35.812 28.777 12.75 18.354 10.377 15.07 88.474 69.692 15.9 13.668 8.346
- ACC 97.272 95.847 95.862 97.833 97.848 99.045 98.151 73.219 94.983 95.696 98.424 99.47
AUC 99.792 98.953 99.499 99.803 99.826 99.766 99.877 86.563 97.922 99.63 99.893 99.963
AP 99.832 98.661 99.538 99.828 99.862 99.716 99.914 85.449 97.861 99.675 99.928 99.969
EER 1.887 4.073 2.947 1.755 1.523 0.993 1.192 21.159 6.424 2.682 0.861 0.53
Table 34: Detailed fairness and utility evaluation results on SD v1.5.
Model Type
Naive Frequency Spatial Fairness-enhanced
Dataset Attribute Metric
Xception
 [61]
EfficientB4
 [62]
ViT-B/16
 [63]
F3Net
 [64]
SPSL
 [65]
SRM
 [66]
UCF
 [16]
UnivFD
 [67]
CORE
 [68]
DAW-FDD
 [20]
DAG-FDD
 [20]
PG-FDD
 [21]
SD Inpainting Gender FMEOsubscript𝐹𝑀𝐸𝑂F_{MEO}italic_F start_POSTSUBSCRIPT italic_M italic_E italic_O end_POSTSUBSCRIPT 2.241 6.934 2.814 2.686 1.636 2.432 1.652 4.449 1.288 2.495 1.455 0.849
FDPsubscript𝐹𝐷𝑃F_{DP}italic_F start_POSTSUBSCRIPT italic_D italic_P end_POSTSUBSCRIPT 20.739 14.685 21.614 23.834 21.747 23.587 23.36 12.632 19.574 19.711 22.764 22.798
FOAEsubscript𝐹𝑂𝐴𝐸F_{OAE}italic_F start_POSTSUBSCRIPT italic_O italic_A italic_E end_POSTSUBSCRIPT 2.701 7.184 0.172 1.041 2.434 1.154 0.901 6.926 1.511 3.887 1.367 0.746
FEOsubscript𝐹𝐸𝑂F_{EO}italic_F start_POSTSUBSCRIPT italic_E italic_O end_POSTSUBSCRIPT 2.704 6.971 3.64 3.722 3.096 2.449 2.302 5.737 2.278 2.932 1.837 1.048
Race FMEOsubscript𝐹𝑀𝐸𝑂F_{MEO}italic_F start_POSTSUBSCRIPT italic_M italic_E italic_O end_POSTSUBSCRIPT 2.353 8.159 13.709 4.767 2.884 3.957 4.424 15.135 23.357 4.628 4.424 1.599
FDPsubscript𝐹𝐷𝑃F_{DP}italic_F start_POSTSUBSCRIPT italic_D italic_P end_POSTSUBSCRIPT 11.33 9.279 10.32 11.773 10.267 12.654 11.567 6.414 10.94 12.332 13.285 12.717
FOAEsubscript𝐹𝑂𝐴𝐸F_{OAE}italic_F start_POSTSUBSCRIPT italic_O italic_A italic_E end_POSTSUBSCRIPT 2.569 6.514 5.566 1.598 2.857 1.908 2.863 7.86 9.299 2.983 2.904 1.378
FEOsubscript𝐹𝐸𝑂F_{EO}italic_F start_POSTSUBSCRIPT italic_E italic_O end_POSTSUBSCRIPT 6.652 21.735 26.462 7.096 7.51 8.601 8.099 34.693 34.503 9.538 11.207 3.61
Age FMEOsubscript𝐹𝑀𝐸𝑂F_{MEO}italic_F start_POSTSUBSCRIPT italic_M italic_E italic_O end_POSTSUBSCRIPT 6.106 10.945 8.907 8.131 6.14 6.319 6.518 5.833 5.738 6.494 6.329 1.77
FDPsubscript𝐹𝐷𝑃F_{DP}italic_F start_POSTSUBSCRIPT italic_D italic_P end_POSTSUBSCRIPT 34.829 24.512 35.588 35.072 36.422 36.373 35.589 22.872 33.578 35.295 35.397 36.172
FOAEsubscript𝐹𝑂𝐴𝐸F_{OAE}italic_F start_POSTSUBSCRIPT italic_O italic_A italic_E end_POSTSUBSCRIPT 5.83 11.618 4.664 3.031 4.865 2.925 3.538 10.413 2.475 6.172 3.678 1.276
FEOsubscript𝐹𝐸𝑂F_{EO}italic_F start_POSTSUBSCRIPT italic_E italic_O end_POSTSUBSCRIPT 11.823 19.295 20.31 11.192 12.313 8.877 10.26 16.169 10.115 11.654 10.325 4.012
Intersection FMEOsubscript𝐹𝑀𝐸𝑂F_{MEO}italic_F start_POSTSUBSCRIPT italic_M italic_E italic_O end_POSTSUBSCRIPT 6.237 19.863 19.037 8.725 5.369 6.711 7.383 19.06 31.294 12.213 7.383 3.693
FDPsubscript𝐹𝐷𝑃F_{DP}italic_F start_POSTSUBSCRIPT italic_D italic_P end_POSTSUBSCRIPT 27.04 21.323 28.54 30.477 27.278 30.434 29.669 17.974 28.629 25.716 30.228 30.269
FOAEsubscript𝐹𝑂𝐴𝐸F_{OAE}italic_F start_POSTSUBSCRIPT italic_O italic_A italic_E end_POSTSUBSCRIPT 4.884 13.648 6.769 4.329 5.395 3.493 4.397 14.884 12.987 9.387 6.247 3.194
FEOsubscript𝐹𝐸𝑂F_{EO}italic_F start_POSTSUBSCRIPT italic_E italic_O end_POSTSUBSCRIPT 18.736 61.309 52.861 25.231 20.741 27.99 21.834 69.995 61.174 28.951 22.144 11.665
- ACC 95.133 86.754 94.333 96.517 95.475 97.445 96.86 78.49 94.105 92.849 96.846 98.715
AUC 99.552 97.281 98.31 99.525 99.547 99.659 99.687 89.51 97.403 99.386 99.707 99.912
AP 99.679 97.138 98.529 99.652 99.677 99.727 99.766 91.313 97.767 99.564 99.79 99.939
EER 3.434 7.631 6.729 3.226 3.33 2.463 2.428 17.933 7.527 3.954 2.393 1.283
Table 35: Detailed fairness and utility evaluation results on SD Inpainting.

B.5 Details of Post-Processing

In Section 4 we have applied 6 post-processing methods to evaluate detectors’ robustness. Fig. B.1 visualizes the image after being applied different post-processing methods. We describe each post-processing method as follows:

JPEG Compression: Image compression introduces compression artifacts and reduces the image quality, simulating real-world scenarios where images may be of lower quality or have compression artifacts. In Fig. 6 we apply image compression with quality 60 to each image in the test set.

Gaussian Blur: This post-processing reduces image detail and noise by smoothing it through averaging pixel values with a Gaussian kernel. In Fig. 6 we apply gaussian blur with kernel size 7 to each image in the test set.

Hue Saturation Value: Alters the hue, saturation, and value of the image within specified limits. This post-processing technique is used to simulate variations in color and lighting conditions. Adjusting the hue changes the overall color tone, saturation controls the intensity of colors, and value adjusts the brightness. The results in Fig. 6 are after we adjust hue, saturation, and value with shifting limits 30.

Random Brightness and Contrast: This post-processing method adjusts the brightness and contrast of the image within specified limits. By applying random brightness and contrast variations, it introduces changes in the illumination and contrast levels of the images. This evaluates detector’s robustness to different illumination conditions. The results in Fig. 6 are after we adjust brightness and contrast with shifting limits 0.2.

Random Crop: Resizes the image to a specified size and then randomly crops a portion of it to the target dimensions. This post-processing method is used to evaluate the detector’s robustness to variations in the spatial content of the image. The results in Fig. 6 are after we randomly crop the image with target dimension of 244×244244244244\times 244244 × 244.

Rotation: Rotates the image within a specified angle limit. This post-processing method is used to evaluate the detector’s robustness to changes in the orientation of objects within the image. The results in Fig. 6 are after we randomly rotate the image within a range of -45 to 45 degrees.

Refer to caption
Figure B.1: Visualization of the image after different post-processing.
Refer to caption
Figure B.2: Robustness analysis in terms of utility and fairness under varying degrees of JPEG compression.
Refer to caption
Figure B.3: Robustness analysis in terms of utility and fairness under varying kernel sizes of Gaussian Blur.
Refer to caption
Figure B.4: Robustness analysis in terms of utility and fairness under varying degrees of Hue Saturation Value.
Refer to caption
Figure B.5: Robustness analysis in terms of utility and fairness under varying degrees of Rotations.
Refer to caption
Figure B.6: Robustness analysis in terms of utility and fairness under varying degrees of Brightness Contrast.

B.6 Additional Fairness Robustness Evaluation Results

Fig. B.2 to Fig. B.6 demonstrate detectors’ robustness analysis in more detail as a function of different degrees of post-processing. Overall, ViT-B/16 [63] and UnivFD [67] show stronger robustness to various post-processing methods compared to other detection methods. Fairness-enhanced detectors do not have robustness against post-processing; this would be a direction for future studies to work on. Figure B.2 presents a detailed robustness analysis in terms of utility and fairness under varying degrees of JPEG compression. The utility of all detectors decreases as image quality is reduced. Among the detectors, UnivFD [67] exhibits the highest utility robustness, while ViT-B/16 [63] demonstrates the strongest fairness robustness. When considering Gaussian blur, ViT-B/16 stands out as the most robust detector in terms of utility, whereas EfficientB4 [62] shows the greatest robustness in terms of fairness. Against Hue Saturation Value adjustments, DAW-FDD [20] shows the strongest utility robustness, while UnivFD excels in fairness robustness. ViT-B/16 demonstrates superior robustness in both utility and fairness when facing rotations. For brightness contrast variations, DAG-FDD [20] is the most robust detector in terms of utility, while UnivFD once again shows superior robustness in terms of fairness.

Refer to caption
Figure B.7: The utility-fairness trade-off of Fairness-enhanced methods.

B.7 Additional Fairness Generalization Evaluation Results

We conduct additional generalization experiments by using models trained on FF++ [2] to evaluate their generalization performance on our AI-Face test set. For these experiments, we utilize the trained weights and intra-domain performance metrics provided by [16]. Consequently, only the detectors with the pre-trained weights available from [16] are evaluated on our AI-Face test set. Results are shown in Table 6. We report the detailed performance on generation category subsets (i.e., Deepfake Videos, GANs, and DMs) and the overall performance on the whole test set. We observe that detectors exhibit significant performance degradation, approaching coin-toss performance when trained on FF++ and tested on our AI-Face test set. This suggests that detectors trained solely on one deepfake video dataset is not sufficient for detecting face images generated by current more advanced generation models. This also highlights the significance of our AI-Face dataset, which is extensive, diverse and comprehensive in generation methods to develop and evaluate existing AI face detectors. The lowest performance is observed with GANs, likely due to the higher variety of generation methods within this category. Conversely, performance on the Deepfake Videos subset is relatively better. This could be because, despite being different datasets, the deepfake videos may share similar generation methods, resulting in less variation in the artifacts present in the generated images.

Type Detector Intra- Domain (FF++) Cross-Domain (Ours w/o FF++) Test Subset Cross-Domain (Ours w/o FF++) Whole Test Set
Deepfake
Videos (3)
GANs (10) DMs (8)
AUC FEOsubscript𝐹𝐸𝑂F_{EO}italic_F start_POSTSUBSCRIPT italic_E italic_O end_POSTSUBSCRIPT AUC FEOsubscript𝐹𝐸𝑂F_{EO}italic_F start_POSTSUBSCRIPT italic_E italic_O end_POSTSUBSCRIPT AUC FEOsubscript𝐹𝐸𝑂F_{EO}italic_F start_POSTSUBSCRIPT italic_E italic_O end_POSTSUBSCRIPT AUC FEOsubscript𝐹𝐸𝑂F_{EO}italic_F start_POSTSUBSCRIPT italic_E italic_O end_POSTSUBSCRIPT AUC
Naive Xception [61] 96.370 104.961 77.766 139.963 58.228 110.977 78.622 101.194 72.649
EfficientB4 [62] 95.670 110.626 76.612 148.656 44.501 88.420 73.426 94.609 65.323
Frequency F3Net [64] 96.350 74.828 74.328 93.278 39.127 89.927 75.480 68.299 65.149
SPSL [65] 96.100 97.558 77.766 141.029 40.100 91.837 58.919 123.534 55.483
SRM [66] 95.760 60.855 74.900 89.903 57.572 73.209 77.954 57.775 72.474
Spatial UCF [16] 97.050 102.798 77.650 122.485 40.477 95.657 77.568 79.479 67.708
CORE [68] 96.380 69.717 76.506 95.727 45.549 79.161 82.112 72.424 70.662
Table 36: Fairness and utility cross-domain evaluation. All detectors are trained on FF++ (model weights and AUC on FF++ test set are from [25]) and evaluated on our Demographically Annotated AI-Face. The best-performing method is highlighted in red.

B.8 Full Results of Effect of Increasing the Size of Train Set

In this section, we provide the full evaluation results tested under different sizes of train set, as shown from Table 37 to Table 40. Intersection FEOsubscript𝐹𝐸𝑂F_{EO}italic_F start_POSTSUBSCRIPT italic_E italic_O end_POSTSUBSCRIPT and AUC align with the results in Fig. 7 of the submitted manuscript.

Model Type
Naive Frequency Spatial Fairness-enhanced
Dataset Size Attribute Metric
Xception
 [61]
EfficientB4
 [62]
ViT-B/16
 [63]
F3Net
 [64]
SPSL
 [65]
SRM
 [66]
UCF
 [16]
UnivFD
 [67]
CORE
 [68]
DAW-FDD
 [20]
DAG-FDD
 [20]
PG-FDD
 [21]
20% Gender FMEOsubscript𝐹𝑀𝐸𝑂F_{MEO}italic_F start_POSTSUBSCRIPT italic_M italic_E italic_O end_POSTSUBSCRIPT 1.725 0.366 0.863 1.523 0.916 1.818 0.652 0.369 2.196 1.657 1.549 1.428
FDPsubscript𝐹𝐷𝑃F_{DP}italic_F start_POSTSUBSCRIPT italic_D italic_P end_POSTSUBSCRIPT 1.944 2.239 2.618 2.083 2.305 1.586 2.811 2.317 1.543 1.823 2.269 2.106
FOAEsubscript𝐹𝑂𝐴𝐸F_{OAE}italic_F start_POSTSUBSCRIPT italic_O italic_A italic_E end_POSTSUBSCRIPT 1.076 0.419 0.906 1.057 0.775 0.800 0.617 0.620 1.076 0.950 1.145 0.904
FEOsubscript𝐹𝐸𝑂F_{EO}italic_F start_POSTSUBSCRIPT italic_E italic_O end_POSTSUBSCRIPT 2.030 0.386 1.635 1.945 1.280 2.081 0.768 0.629 2.214 1.738 2.244 1.742
Race FMEOsubscript𝐹𝑀𝐸𝑂F_{MEO}italic_F start_POSTSUBSCRIPT italic_M italic_E italic_O end_POSTSUBSCRIPT 14.155 11.039 10.108 13.887 12.235 15.231 11.756 14.625 16.804 16.116 12.021 12.645
FDPsubscript𝐹𝐷𝑃F_{DP}italic_F start_POSTSUBSCRIPT italic_D italic_P end_POSTSUBSCRIPT 23.488 20.018 22.360 23.285 22.782 22.998 22.994 22.628 25.752 23.457 23.093 22.572
FOAEsubscript𝐹𝑂𝐴𝐸F_{OAE}italic_F start_POSTSUBSCRIPT italic_O italic_A italic_E end_POSTSUBSCRIPT 5.266 5.286 5.057 5.416 4.807 5.425 5.063 6.459 5.913 5.009 4.877 4.676
FEOsubscript𝐹𝐸𝑂F_{EO}italic_F start_POSTSUBSCRIPT italic_E italic_O end_POSTSUBSCRIPT 24.015 19.947 25.662 25.293 22.940 23.207 24.837 28.765 29.623 22.162 22.625 21.318
Age FMEOsubscript𝐹𝑀𝐸𝑂F_{MEO}italic_F start_POSTSUBSCRIPT italic_M italic_E italic_O end_POSTSUBSCRIPT 6.766 5.613 5.335 7.254 5.765 6.506 8.761 5.411 7.208 5.948 5.672 5.769
FDPsubscript𝐹𝐷𝑃F_{DP}italic_F start_POSTSUBSCRIPT italic_D italic_P end_POSTSUBSCRIPT 5.086 5.581 6.666 5.089 5.561 4.659 6.170 6.073 4.556 5.080 5.291 5.337
FOAEsubscript𝐹𝑂𝐴𝐸F_{OAE}italic_F start_POSTSUBSCRIPT italic_O italic_A italic_E end_POSTSUBSCRIPT 3.784 3.177 4.958 3.745 3.493 3.435 4.491 4.692 4.209 4.183 3.159 3.242
FEOsubscript𝐹𝐸𝑂F_{EO}italic_F start_POSTSUBSCRIPT italic_E italic_O end_POSTSUBSCRIPT 9.533 9.157 12.476 9.632 9.222 9.203 11.928 14.228 10.548 9.699 8.470 8.339
Intersection FMEOsubscript𝐹𝑀𝐸𝑂F_{MEO}italic_F start_POSTSUBSCRIPT italic_M italic_E italic_O end_POSTSUBSCRIPT 17.912 12.056 14.781 17.613 14.966 19.221 14.360 17.533 20.977 19.466 15.288 15.734
FDPsubscript𝐹𝐷𝑃F_{DP}italic_F start_POSTSUBSCRIPT italic_D italic_P end_POSTSUBSCRIPT 25.299 22.237 23.053 25.005 23.895 25.807 23.863 23.563 27.720 25.542 24.273 24.374
FOAEsubscript𝐹𝑂𝐴𝐸F_{OAE}italic_F start_POSTSUBSCRIPT italic_O italic_A italic_E end_POSTSUBSCRIPT 8.001 9.313 8.647 7.898 7.506 6.137 8.856 11.806 8.713 5.859 7.538 5.378
FEOsubscript𝐹𝐸𝑂F_{EO}italic_F start_POSTSUBSCRIPT italic_E italic_O end_POSTSUBSCRIPT 54.208 45.790 54.752 56.299 50.295 52.526 55.119 66.894 63.986 44.272 49.137 45.127
- ACC 95.175 94.292 93.972 94.913 95.084 95.534 95.249 90.810 94.835 94.996 95.243 95.602
AUC 98.620 99.055 98.765 98.284 98.851 98.026 98.728 96.404 98.237 98.403 98.731 98.533
AP 98.805 99.325 99.132 98.441 99.083 98.353 98.931 97.227 98.410 98.578 98.980 98.695
EER 5.563 5.208 6.267 6.142 5.292 5.489 4.933 10.001 6.169 5.696 5.424 5.148
Table 37: Detailed fairness and utility evaluation results on 20% training subset.
Model Type
Naive Frequency Spatial Fairness-enhanced
Dataset Size Attribute Metric
Xception
 [61]
EfficientB4
 [62]
ViT-B/16
 [63]
F3Net
 [64]
SPSL
 [65]
SRM
 [66]
UCF
 [16]
UnivFD
 [67]
CORE
 [68]
DAW-FDD
 [20]
DAG-FDD
 [20]
PG-FDD
 [21]
40% Gender FMEOsubscript𝐹𝑀𝐸𝑂F_{MEO}italic_F start_POSTSUBSCRIPT italic_M italic_E italic_O end_POSTSUBSCRIPT 1.771 2.562 1.588 1.383 1.277 0.567 1.191 0.752 1.465 1.034 1.303 1.41
FDPsubscript𝐹𝐷𝑃F_{DP}italic_F start_POSTSUBSCRIPT italic_D italic_P end_POSTSUBSCRIPT 1.801 3.841 1.948 2.33 2.088 1.955 2.715 2.756 2.113 2.362 2.128 2.998
FOAEsubscript𝐹𝑂𝐴𝐸F_{OAE}italic_F start_POSTSUBSCRIPT italic_O italic_A italic_E end_POSTSUBSCRIPT 0.908 0.439 0.881 1.117 0.661 0.078 1.281 0.625 0.971 0.811 0.793 1.193
FEOsubscript𝐹𝐸𝑂F_{EO}italic_F start_POSTSUBSCRIPT italic_E italic_O end_POSTSUBSCRIPT 1.799 3.236 1.809 2.034 1.415 1.095 2.286 1.023 1.796 1.474 1.51 2.263
Race FMEOsubscript𝐹𝑀𝐸𝑂F_{MEO}italic_F start_POSTSUBSCRIPT italic_M italic_E italic_O end_POSTSUBSCRIPT 14.7 12.688 9.731 14.333 10.203 7.959 14.511 14.169 14.04 11.7 12.504 7.79
FDPsubscript𝐹𝐷𝑃F_{DP}italic_F start_POSTSUBSCRIPT italic_D italic_P end_POSTSUBSCRIPT 23.675 21.948 22.57 23.424 22.165 21.403 25.546 23.264 22.811 21.994 22.856 21.024
FOAEsubscript𝐹𝑂𝐴𝐸F_{OAE}italic_F start_POSTSUBSCRIPT italic_O italic_A italic_E end_POSTSUBSCRIPT 5.079 6.318 4.774 4.986 4.49 3.571 5.708 6.282 5.222 3.819 4.448 4.043
FEOsubscript𝐹𝐸𝑂F_{EO}italic_F start_POSTSUBSCRIPT italic_E italic_O end_POSTSUBSCRIPT 23.443 17.727 22.852 22.917 20.787 17.734 30.682 30.21 22.288 17.342 20.633 18.703
Age FMEOsubscript𝐹𝑀𝐸𝑂F_{MEO}italic_F start_POSTSUBSCRIPT italic_M italic_E italic_O end_POSTSUBSCRIPT 7.594 3.343 4.055 6.859 5.051 4.126 6.145 5.676 6.85 5.874 6.46 3.48
FDPsubscript𝐹𝐷𝑃F_{DP}italic_F start_POSTSUBSCRIPT italic_D italic_P end_POSTSUBSCRIPT 4.951 6.723 5.222 5.421 5.485 4.937 5.55 6.471 5.272 5.672 5.447 6.276
FOAEsubscript𝐹𝑂𝐴𝐸F_{OAE}italic_F start_POSTSUBSCRIPT italic_O italic_A italic_E end_POSTSUBSCRIPT 3.873 1.951 2.928 3.709 3.057 2.589 3.747 4.713 3.447 3.655 3.689 3.158
FEOsubscript𝐹𝐸𝑂F_{EO}italic_F start_POSTSUBSCRIPT italic_E italic_O end_POSTSUBSCRIPT 9.596 9.58 8.258 8.736 8.461 8.457 9.256 14.995 8.374 9.236 9.119 8.222
Intersection FMEOsubscript𝐹𝑀𝐸𝑂F_{MEO}italic_F start_POSTSUBSCRIPT italic_M italic_E italic_O end_POSTSUBSCRIPT 18.307 20.275 14.911 17.454 12.641 12.131 19.386 17.922 17.346 13.83 15.213 11.211
FDPsubscript𝐹𝐷𝑃F_{DP}italic_F start_POSTSUBSCRIPT italic_D italic_P end_POSTSUBSCRIPT 25.685 24.437 23.109 24.801 23.725 22.091 26.683 23.444 24.706 23.145 24.355 21.662
FOAEsubscript𝐹𝑂𝐴𝐸F_{OAE}italic_F start_POSTSUBSCRIPT italic_O italic_A italic_E end_POSTSUBSCRIPT 5.814 10.624 8.96 5.964 5.707 6.691 10.527 11.615 5.902 4.63 5.04 6.579
FEOsubscript𝐹𝐸𝑂F_{EO}italic_F start_POSTSUBSCRIPT italic_E italic_O end_POSTSUBSCRIPT 48.63 41.478 49.564 47.936 44.562 37.57 63.173 66.301 49.171 34.392 43.13 40.847
- ACC 95.796 94.03 94.822 95.844 95.794 95.393 94.754 90.711 95.984 95.975 96.257 95.337
AUC 98.696 98.932 99.024 98.851 99.064 98.722 98.306 96.371 98.824 98.974 98.949 99.092
AP 98.778 99.269 99.31 98.959 99.236 98.968 98.588 97.224 98.984 99.139 99.035 99.318
EER 5.027 5.442 5.474 5.002 4.574 4.77 6.037 10.044 5.009 4.567 4.285 4.729
Table 38: Detailed fairness and utility evaluation results on 40% training subset.
Model Type
Naive Frequency Spatial Fairness-enhanced
Dataset Size Attribute Metric
Xception
 [61]
EfficientB4
 [62]
ViT-B/16
 [63]
F3Net
 [64]
SPSL
 [65]
SRM
 [66]
UCF
 [16]
UnivFD
 [67]
CORE
 [68]
DAW-FDD
 [20]
DAG-FDD
 [20]
PG-FDD
 [21]
60% Gender FMEOsubscript𝐹𝑀𝐸𝑂F_{MEO}italic_F start_POSTSUBSCRIPT italic_M italic_E italic_O end_POSTSUBSCRIPT 0.801 0.489 2.239 1.576 1.787 1.179 0.745 0.539 0.907 1.512 1.527 0.408
FDPsubscript𝐹𝐷𝑃F_{DP}italic_F start_POSTSUBSCRIPT italic_D italic_P end_POSTSUBSCRIPT 2.899 2.747 4.223 1.996 1.638 3.062 2.596 2.509 2.291 3.01 1.88 2.857
FOAEsubscript𝐹𝑂𝐴𝐸F_{OAE}italic_F start_POSTSUBSCRIPT italic_O italic_A italic_E end_POSTSUBSCRIPT 0.692 0.109 0.935 0.904 0.802 0.361 0.783 0.634 0.697 1.296 0.716 0.27
FEOsubscript𝐹𝐸𝑂F_{EO}italic_F start_POSTSUBSCRIPT italic_E italic_O end_POSTSUBSCRIPT 1.136 0.677 3.474 1.757 2.002 1.22 1.328 0.586 1.153 2.435 1.594 0.547
Race FMEOsubscript𝐹𝑀𝐸𝑂F_{MEO}italic_F start_POSTSUBSCRIPT italic_M italic_E italic_O end_POSTSUBSCRIPT 8.652 16.885 6.328 13.433 16.19 6.243 9.96 14.482 14.243 9.849 14.223 5.96
FDPsubscript𝐹𝐷𝑃F_{DP}italic_F start_POSTSUBSCRIPT italic_D italic_P end_POSTSUBSCRIPT 21.794 26.205 14.609 23.519 21.498 18.874 20.947 23.469 24.031 20.453 23.547 15.86
FOAEsubscript𝐹𝑂𝐴𝐸F_{OAE}italic_F start_POSTSUBSCRIPT italic_O italic_A italic_E end_POSTSUBSCRIPT 3.781 6.63 5.65 4.671 5.716 4.346 4.133 6.328 4.569 3.77 5.247 3.746
FEOsubscript𝐹𝐸𝑂F_{EO}italic_F start_POSTSUBSCRIPT italic_E italic_O end_POSTSUBSCRIPT 18.942 23.67 21.707 22.128 23.478 12.107 15.885 29.99 22.562 14.96 22.213 14.735
Age FMEOsubscript𝐹𝑀𝐸𝑂F_{MEO}italic_F start_POSTSUBSCRIPT italic_M italic_E italic_O end_POSTSUBSCRIPT 5.047 5.153 5.719 6.155 4.154 3.81 4.71 5.243 5.789 5.512 3.699 3.553
FDPsubscript𝐹𝐷𝑃F_{DP}italic_F start_POSTSUBSCRIPT italic_D italic_P end_POSTSUBSCRIPT 6.012 4.411 7.664 5.157 5.456 6.02 6.042 6.245 5.444 7.926 5.023 6.488
FOAEsubscript𝐹𝑂𝐴𝐸F_{OAE}italic_F start_POSTSUBSCRIPT italic_O italic_A italic_E end_POSTSUBSCRIPT 2.916 2.496 4.283 3.316 3.897 2.374 2.752 4.555 3.422 3.886 2.858 2.244
FEOsubscript𝐹𝐸𝑂F_{EO}italic_F start_POSTSUBSCRIPT italic_E italic_O end_POSTSUBSCRIPT 7.607 10.635 13.662 8.084 9.09 7.503 6.872 14.282 8.089 8.321 6.951 8.124
Intersection FMEOsubscript𝐹𝑀𝐸𝑂F_{MEO}italic_F start_POSTSUBSCRIPT italic_M italic_E italic_O end_POSTSUBSCRIPT 10.466 25.134 7.982 16.425 17.532 10.693 12.44 17.613 16.374 12.417 17.272 9.574
FDPsubscript𝐹𝐷𝑃F_{DP}italic_F start_POSTSUBSCRIPT italic_D italic_P end_POSTSUBSCRIPT 22.891 27.88 18.106 25.118 24.338 20.236 22.678 23.819 25.063 22.277 25.459 18.176
FOAEsubscript𝐹𝑂𝐴𝐸F_{OAE}italic_F start_POSTSUBSCRIPT italic_O italic_A italic_E end_POSTSUBSCRIPT 5.884 11.229 7.443 5.547 6.822 7.714 4.749 11.612 5.287 5.726 5.873 5.899
FEOsubscript𝐹𝐸𝑂F_{EO}italic_F start_POSTSUBSCRIPT italic_E italic_O end_POSTSUBSCRIPT 39.873 51.509 46.548 46.511 52.673 28.055 35.888 66.884 45.261 30.682 47.626 31.167
- ACC 96.505 93.931 93.612 96.221 95.676 96.51 96.567 90.882 96.009 95.025 96.332 96.488
AUC 98.97 98.828 98.536 99.075 99.102 99.236 99.026 96.461 99.189 99.003 99.354 99.401
AP 98.987 99.195 98.953 99.17 99.234 99.415 99.012 97.279 99.351 99.285 99.503 99.461
EER 3.829 6.004 6.668 4.322 4.314 3.248 3.592 9.875 4.351 5.072 3.882 3.583
Table 39: Detailed fairness and utility evaluation results on 60% training subset.
Model Type
Naive Frequency Spatial Fairness-enhanced
Dataset Size Attribute Metric
Xception
 [61]
EfficientB4
 [62]
ViT-B/16
 [63]
F3Net
 [64]
SPSL
 [65]
SRM
 [66]
UCF
 [16]
UnivFD
 [67]
CORE
 [68]
DAW-FDD
 [20]
DAG-FDD
 [20]
PG-FDD
 [21]
80% Gender FMEOsubscript𝐹𝑀𝐸𝑂F_{MEO}italic_F start_POSTSUBSCRIPT italic_M italic_E italic_O end_POSTSUBSCRIPT 1.753 0.256 1.697 0.976 1.199 0.166 1.235 0.447 1.428 0.398 0.526 0.339
FDPsubscript𝐹𝐷𝑃F_{DP}italic_F start_POSTSUBSCRIPT italic_D italic_P end_POSTSUBSCRIPT 1.925 2.648 1.891 2.861 3.002 2.642 3.511 2.461 1.881 2.762 2.695 2.643
FOAEsubscript𝐹𝑂𝐴𝐸F_{OAE}italic_F start_POSTSUBSCRIPT italic_O italic_A italic_E end_POSTSUBSCRIPT 0.943 0.002 0.988 0.280 0.474 0.316 0.737 0.596 0.665 0.214 0.408 0.172
FEOsubscript𝐹𝐸𝑂F_{EO}italic_F start_POSTSUBSCRIPT italic_E italic_O end_POSTSUBSCRIPT 1.893 0.364 1.910 1.214 1.489 0.218 1.237 0.467 1.495 0.522 0.788 0.384
Race FMEOsubscript𝐹𝑀𝐸𝑂F_{MEO}italic_F start_POSTSUBSCRIPT italic_M italic_E italic_O end_POSTSUBSCRIPT 11.908 11.806 9.589 4.724 3.751 8.864 2.988 14.911 13.396 3.892 5.036 2.891
FDPsubscript𝐹𝐷𝑃F_{DP}italic_F start_POSTSUBSCRIPT italic_D italic_P end_POSTSUBSCRIPT 22.332 21.476 19.620 18.520 17.354 18.431 16.783 23.411 22.809 16.666 18.631 17.598
FOAEsubscript𝐹𝑂𝐴𝐸F_{OAE}italic_F start_POSTSUBSCRIPT italic_O italic_A italic_E end_POSTSUBSCRIPT 4.298 6.127 4.724 3.890 3.573 4.030 2.667 6.322 4.970 3.282 4.350 2.731
FEOsubscript𝐹𝐸𝑂F_{EO}italic_F start_POSTSUBSCRIPT italic_E italic_O end_POSTSUBSCRIPT 18.793 19.118 20.637 10.997 9.458 14.889 10.966 29.610 22.226 9.621 13.148 8.090
Age FMEOsubscript𝐹𝑀𝐸𝑂F_{MEO}italic_F start_POSTSUBSCRIPT italic_M italic_E italic_O end_POSTSUBSCRIPT 5.554 4.823 4.219 2.168 3.355 2.588 1.699 5.731 6.528 2.822 1.498 1.076
FDPsubscript𝐹𝐷𝑃F_{DP}italic_F start_POSTSUBSCRIPT italic_D italic_P end_POSTSUBSCRIPT 5.307 5.675 5.586 6.111 6.492 6.159 6.884 6.433 4.840 6.781 5.943 5.842
FOAEsubscript𝐹𝑂𝐴𝐸F_{OAE}italic_F start_POSTSUBSCRIPT italic_O italic_A italic_E end_POSTSUBSCRIPT 3.001 2.397 3.365 1.221 1.905 1.389 0.832 4.916 3.150 2.718 1.133 0.744
FEOsubscript𝐹𝐸𝑂F_{EO}italic_F start_POSTSUBSCRIPT italic_E italic_O end_POSTSUBSCRIPT 7.274 8.476 8.710 4.026 8.252 5.533 5.139 15.746 8.380 7.114 4.174 2.835
Intersection FMEOsubscript𝐹𝑀𝐸𝑂F_{MEO}italic_F start_POSTSUBSCRIPT italic_M italic_E italic_O end_POSTSUBSCRIPT 14.979 17.336 11.294 6.650 6.863 9.372 5.369 18.159 16.769 5.729 8.210 5.443
FDPsubscript𝐹𝐷𝑃F_{DP}italic_F start_POSTSUBSCRIPT italic_D italic_P end_POSTSUBSCRIPT 24.220 21.943 21.145 21.258 20.254 20.920 19.077 24.033 24.954 18.015 20.556 19.798
FOAEsubscript𝐹𝑂𝐴𝐸F_{OAE}italic_F start_POSTSUBSCRIPT italic_O italic_A italic_E end_POSTSUBSCRIPT 5.025 10.608 7.908 6.697 5.709 6.343 5.118 11.541 5.558 5.760 7.955 4.583
FEOsubscript𝐹𝐸𝑂F_{EO}italic_F start_POSTSUBSCRIPT italic_E italic_O end_POSTSUBSCRIPT 40.744 44.028 45.684 27.249 22.360 32.012 24.504 66.750 46.492 21.906 29.401 17.687
- ACC 96.629 94.917 94.904 95.309 96.461 96.548 97.736 90.898 95.586 95.808 97.317 98.277
AUC 99.361 98.788 99.143 99.409 99.597 99.682 99.753 96.501 98.440 99.419 99.739 99.860
AP 99.429 99.051 99.403 99.523 99.653 99.765 99.801 97.308 98.562 99.589 99.817 99.874
EER 3.538 5.198 5.189 3.894 3.138 2.707 2.276 9.817 5.470 4.259 2.745 1.738
Table 40: Detailed fairness and utility evaluation results on 80% training subset.

B.9 Fairness and Utility Trade-off

Fig. B.7 presents the trade-offs between FDPsubscript𝐹𝐷𝑃F_{DP}italic_F start_POSTSUBSCRIPT italic_D italic_P end_POSTSUBSCRIPT on age and AUC of three fairness-enhanced methods. This is to analyze how well these methods balance optimizing utility and ensuring fairness in decision-making. 1) PG-FDD [21] achieves the best utility-fairness trade-off overall. It improves fairness without compromising the precision of utility, maintaining high accuracy in detection. For instance, PG-FDD achieves a higher AUC than DAW-FDD and DAG-FDD while maintaining comparable fairness metrics. 2) DAW-FDD [20] is sensitive to the hyperparameter that balances utility-fairness. For example, when its fairness approaches to zero, its utility also drops to a coin-tossing performance. This sensitivity can hinder practical deployment, as extensive tuning is required to optimize performance. 3) To ensure broader applicability and reliability, future fairness approaches should aim to minimize sensitivity to hyperparameter settings.

Appendix C Datasheet for AI-Face

In this section, we present a DataSheet [87] for AI-Face.

C.1 Motivation For Dataset Creation

  • Why is the dataset created? For researchers to evaluate the fairness of AI face detection models or to train fairer models. Please see Section 2 ‘Background and Motivation’ in the submitted manuscript.

  • Has the dataset been used already? Yes. Our fairness benchmark is based on this dataset.

  • What (other) tasks could the dataset be used for? Could be used as training data for generative methods attribution task.

  • Who funded dataset creation? This work is supported by the U.S. National Science Foundation (NSF) under grant IIS-2348419 and the National Artificial Intelligence Research Resource (NAIRR) Pilot and TACC Lonestar6. Please see Acknowledgment Acknowledgment.

C.2 Data Composition

  • What are the instances? The instances that we consider in this work are real face images and AI-generated face images from public datasets.

  • How many instances are there? We include more than 2 million face images from public datasets. Please see Table 13 for details.

  • What data does each instance consist of? Each instance consists of an image.

  • Is there a label or target associated with each instance? Each image is associated with uncertainty score for gender prediction, uncertainty score for age prediction, uncertainty score for race prediction, gender annotation, age annotation, race annotation, and target label (fake or real).

  • Is any information missing from individual instances? No.

  • Are relationships between individual instances made explicit? Not applicable – we do not study the relationship between each image.

  • Does the dataset contain all possible instances or is it a sample? Contains all instances our curation pipeline collected. Since the current dataset does not cover all available images online, there is a high probability more instances can be collected in the future.

  • Are there recommended data splits (e.g., training, development/validation, testing)? For detector development and training, the dataset can be split as 6:2:2.

  • Are there any errors, sources of noise, or redundancies in the dataset? If so, please provide a description. Yes. Despite our extensive efforts to reduce demographic label noise, including human corrections based on uncertainty scores, there may still be mislabeled instances. Given the dataset’s size of over 2 million images, it is impractical for humans to manually check and correct each image individually.

  • Is the dataset self-contained, or does it link to or otherwise rely on external resources (e.g., websites, tweets, other datasets)? The dataset is self-contained.

C.3 Collection Process

  • What mechanisms or procedures were used to collect the data? We build our AI-Face dataset by collecting and integrating public AI-generated face images sourced from academic publications, GitHub repositories, and commercial tools. Please see ‘Data Collection’ in Section 3.2

  • How was the data associated with each instance acquired? Was the data directly observable (e.g., raw text, movie ratings), reported by subjects (e.g., survey responses), or indirectly inferred/derived from other data? The data can be acquired after our verification of user submitted and signed EULA.

  • If the dataset is a sample from a larger set, what was the sampling strategy (e.g., deterministic, probabilistic with specific sampling probabilities)? Not applicable. We did not sample data from a larger set. But we use RetinaFace [60] for detecting and cropping faces to ensure each image only contains one face.

  • Over what timeframe was the data collected? Does this timeframe match the creation timeframe of the data associated with the instances (e.g., recent crawl of old news articles)? If not, please describe the timeframe in which the data associated with the instances was created. The data was collected from February 2024 to April 2024, even though the data were originally released before this time. Please refer to the cited papers in Table 13 for specific original data released time.

C.4 Data Processing

  • Was any preprocessing/cleaning/labeling of the data done (e.g., discretization or bucketing, tokenization, part-of-speech tagging, SIFT feature extraction, removal of instances, processing of missing values)? Yes. We discussed in ‘Demographically Annotation Generation’ in Section 3.2.

  • Was the ‘raw’ data saved in addition to the preprocessed/cleaned/labeled data (e.g., to support unanticipated future uses)? If so, please provide a link or other access point to the ‘raw’ data. The ‘raw’ data can be acquired through the original data publisher. Please see the cited papers in Table 13.

  • Is the software used to preprocess/clean/label the instances available? If so, please provide a link or other access point. Yes. We use RetinaFace [60] for detecting and cropping faces to ensure each image only contains one face. Demographic annotations are given by our annotator, see ‘Annotator Development’ in Section 3.1. Our annotator code is available on Our GitHub repository.

  • Does this dataset collection/processing procedure achieve the motivation for creating the dataset stated in the first section of this datasheet? If not, what are the limitations? Yes. The dataset does allow for the study of our goal, as it covers comprehensive generation methods, demographic annotations for evaluating current detectors and training fairer detectors.

C.5 Dataset Distribution

  • How will the dataset be distributed? We distribute all the data as well as CSV files that formatted all annotations of images under the CC BY-NC-ND 4.0 license and strictly for research purposes.

  • When will the dataset be released/first distributed? What license (if any) is it distributed under? The data has been released, under the permissible CC BY-NC-ND 4.0 license for research-based use only. Users can access our dataset by submitting an EULA. Dataset license and EULA is on our GitHub https://github.com/Purdue-M2/AI-Face-FairnessBench.

  • Are there any copyrights on the data? We believe our use is ‘fair use’ since all data in our dataset is collected from public datasets.

  • Are there any fees or access restrictions? No.

C.6 Dataset Maintenance

  • Who is supporting/hosting/maintaining the dataset? The first author of this paper.

  • Will the dataset be updated? If so, how often and by whom? We do not plan to update it at this time.

  • Is there a repository to link to any/all papers/systems that use this dataset? Not right now, but we encourage anyone who uses the dataset to cite our paper so it can be easily found. Our fairness benchmark uses this dataset, the code of fairness benchmark is on our GitHub https://github.com/Purdue-M2/AI-Face-FairnessBench.

  • If others want to extend/augment/build on this dataset, is there a mechanism for them to do so? Not at this time.

C.7 Legal and Ethical Considerations

  • Were any ethical review processes conducted (e.g., by an institutional review board)? No official processes were done since all data in our dataset were collected from the existing public datasets.

  • Does the dataset contain data that might be considered confidential? No. We only use data from public datasets.

  • Does the dataset contain data that, if viewed directly, might be offensive, insulting, threatening, or might otherwise cause anxiety? If so, please describe why No. It is a face image dataset, we have not seen any instance of offensive or abusive content.

  • Does the dataset relate to people? Yes. It is a face image dataset containing real face images and AI-generated face images.

  • Does the dataset identify any subpopulations (e.g., by age, gender)? Yes, through demographic annotations.

  • Is it possible to identify individuals (i.e., one or more natural persons), either directly or indirectly (i.e., in combination with other data) from the dataset? Yes. It is a face image dataset. The age, gender, and race can be identified through the face image, also through the demographic annotation we provide. All of the images that we use are from publicly available data.

C.8 Author Statement and Confirmation of Data License

The authors of this work declare that the dataset described and provided has been collected, processed, and made available with full adherence to all applicable ethical guidelines and regulations. We accept full responsibility for any violations of rights or ethical guidelines that may arise from the use of this dataset. We also confirm that the dataset is released under the CC BY-NC-ND 4.0 license, permitting sharing and downloading of the work in any medium, provided the original author is credited, and it is used non-commercially with no derivative works created.