\extrafloats

AI-Face: A Million-Scale Demographically Annotated AI-Generated Face Dataset and Fairness Benchmark

Li Lin¹, Santosh¹, Xin Wang², Shu Hu¹
¹Purdue University ²University at Albany, SUNY
Corresponding author: Shu Hu (hu968@purdue.edu)

Abstract

AI-generated faces have enriched human life, such as entertainment, education, and art. However, they also pose misuse risks. Therefore, detecting AI-generated faces becomes crucial, yet current detectors show biased performance across different demographic groups. Mitigating biases can be done by designing algorithmic fairness methods, which usually require demographically annotated face datasets for model training. However, no existing dataset comprehensively encompasses both demographic attributes and diverse generative methods, which hinders the development of fair detectors for AI-generated faces. In this work, we introduce the AI-Face dataset, the first million-scale demographically annotated AI-generated face image dataset, including real faces, faces from deepfake videos, and faces generated by Generative Adversarial Networks and Diffusion Models. Based on this dataset, we conduct the first comprehensive fairness benchmark to assess various AI face detectors and provide valuable insights and findings to promote the future fair design of AI face detectors. Our AI-Face dataset and benchmark code are publicly available at https://github.com/Purdue-M2/AI-Face-FairnessBench.

Refer to caption — Figure 1: Overview of AI-Face dataset. Each face has three demographic annotations with uncertainty scores.

1 Introduction

AI-generated faces are created using sophisticated AI technologies that are visually difficult to discern from real ones [1]. They can be summarized into three categories: deepfake videos [2] created by typically using Variational Autoencoders (VAEs) [3, 4], faces generated from Generative Adversarial Networks (GANs) [5, 6, 7, 8], and Diffusion Models (DMs) [9]. These technologies have significantly advanced the realism and controllability of synthetic facial representations. Generated faces can enrich media and increase creativity [10]. However, they also carry significant risks of misuse. For example, during the 2024 United States presidential election, fake face images of Donald Trump surrounded by groups of black people smiling and laughing to encourage African Americans to vote Republican are spreading online [11]. This could distort public opinion and erode people’s trust in media [12, 13], necessitating the detection of AI-generated faces for their ethical use.

However, one major issue existing in current AI face detectors [14, 15, 16, 17] is biased detection (i.e., unfair detection performance among demographic groups [18, 19, 20, 21]). Mitigating biases can be done by designing algorithmic fairness methods, but they usually require demographically annotated face datasets for model training. For example, works like [20, 21] have made efforts to enhance fairness in the detection based on A-FF++ [19] and A-DFD [19]. However, both datasets are limited to containing only faces from deepfake videos, which could cause the trained models not to be applicable for fairly detecting faces generated by GANs and DMs. Although a few datasets (e.g., GenData [22]) cover GAN and DM faces, their demographic annotations are not comprehensive. Most importantly, no existing dataset is diverse enough in generation methods to develop AI face detectors that can cope with rapidly evolved generative models. These limitations of existing datasets hamper the development of fair technologies for detecting AI-generated faces.

Moreover, benchmarking fairness provides a direct method to uncover prevalent and unique fairness issues in recent AI-generated face detection. However, there is a lack of a comprehensive benchmark to estimate the fairness of existing AI face detectors. Existing benchmarks [23, 24, 25, 26] primarily assess utility, neglecting systematic fairness evaluation. One study [18] does evaluate fairness in detection models, but their examination is only based on deepfake video datasets using a few outdated detectors. Detectors’ fairness performance on GAN faces and DM faces has not been extensively explored. The absence of a comprehensive fairness benchmark impedes a thorough understanding of the fairness behaviors of recent AI face detectors and obscures the research path for detector fairness guarantees.

In this work, we build the first million-scale demographically annotated AI-generated face image dataset: AI-Face (see Fig. 1). The face images are collected from various public datasets, including the real faces that are usually used to train AI face generators, faces from deepfake videos, and faces generated by GANs and DMs. Each face is demographically annotated with an uncertainty score on each predicted demographic attribute by our designed Contrastive Language-Image Pretraining (CLIP) [27]-based lightweight annotator. To improve the quality of annotations, we recruit three humans to correct annotations with high uncertainty scores manually. Next, we conduct the first comprehensive fairness benchmark on our dataset to estimate the fairness performance of 12 representative detectors coming from four model types. Our benchmark exposes common and unique fairness challenges in recent AI face detectors, providing essential insights that can guide and enhance the future design of fair AI face detectors. Our contributions are as follows:

•

We build the first comprehensive million-scale demographically annotated AI-generated face Dataset by leveraging our developed lightweight annotator with human correction.
•

We conduct the first comprehensive fairness benchmark of AI-generated face detectors, providing an extensive fairness assessment of current representative detectors.
•

Based on our experiments and observations, we summarize the unsolved questions and offer valuable insights within this research domain, setting the stage for future investigations.

2 Background and Motivation

AI-generated Faces and Biased Detection. AI-generated face images, created by advanced AI technologies, are visually difficult to discern from real ones, see Fig. 1. They can be summarized into three categories: 1) Deepfake Videos. Initiated in 2017 [13], these use face-swapping techniques with a variational autoencoder to replace a face in a target video with one from a source [3, 4]. Note that our paper focuses solely on images extracted from videos. 2) GAN-generated Faces. Post-2017, Generative Adversarial Networks (GANs) [28] like StyleGANs [6, 7, 8] have significantly improved generated face realism. 3) DM-generated Faces. Diffusion models (DMs), emerging in 2021, generate detailed faces from textual descriptions and offer greater controllability. Tools like Midjourney [29] and DALLE2 [30] facilitate customized face generation. While these AI-generated faces can enhance visual media and creativity [10], they also pose risks, such as being misused in social media profiles [31, 32]. Therefore, numerous studies focus on detecting AI-generated faces [14, 15, 16, 17], but current detectors often show performance disparities among demographic groups like race and gender [18, 19, 20, 21]. This bias can lead to unfair targeting or exclusion, undermining trust in detection models. Recent efforts [20, 21] aim to enhance fairness in deepfake detection but mainly address deepfake videos, overlooking biases in detecting GAN and DM-generated faces.

Dataset

Face Images

Generation Category

#Generation Methods

Source of Real Images

Demographic Annotation

#Real

#Fake

Deepfake Videos

GAN

Gender

Race

Age

A-FF++ [19]

29.8K

149.1K

✓

YouTube

✓

A-DFD [19]

10.8K

89.6K

✓

Self-Recording

✓

A-DFDC [19]

54.5K

52.6K

✓

Self-Recording

✓

A-Celeb-DF-v2 [19]

26.3K

166.5K

✓

Self-Recording

✓

A-DF-1.0 [19]

870.3K

321.5K

✓

Self-Recording

✓

DF-1.0 [33]

2.9M

14.7M

✓

Self-Recording

✓

DeePhy [34]

50.4K

✓

YouTube

✓

DF-Platter [35]

392.3K

653.4K

✓

YouTube

✓

GenData [22]

20K

✓

CelebA [36]

✓

Ours

866K

1.2M

✓

FFHQ [6], CASIA-WebFace [37], CelebA [36]

IMDB-WIKI [38], real from FF++ [2],

DFDC [39], DFD [40],Celeb-DF-v2 [41]

✓

Table 1: Quantitative comparison of existing AI-generated face datasets and ours.

The Related Existing Datasets. Current AI-generated facial datasets with demographic annotations are limited in size, generation categories, methods, and annotations, as illustrated in Table 1. For instance, A-FF++, A-DFD, A-DFDC, and A-Celeb-DF-v2 [19] are deepfake video datasets with fewer than one million images. Datasets like DF-1.0 [33] and DF-Platter [35] lack comprehensive demographic annotations. Additionally, existing datasets offer limited generation methods. These limitations hinder the development of fairer AI face detectors, motivating us to build a million-scale demographically annotated AI-Face dataset.

Existing Benchmarks	Category			Scope of Benchmark
Existing Benchmarks	Deepfake Videos	GAN	DM	Utility	Fairness
DeepfakeBench [25]	✓	✓		✓
Lin et al. [24]	✓	✓		✓
Le et al. [26]	✓	✓		✓
CDDB [23]		✓		✓
Loc et al. [18]	✓			✓	✓
Ours	✓	✓	✓	✓	✓

Table 2: Comparison with existing AI-generated face detection benchmarks.

Benchmark for Detecting AI-generated Faces. Benchmarks are essential for evaluating AI-generated face detectors under standardized conditions. Existing benchmarks, as shown in Table 2, mainly focus on detectors’ utility, often overlooking fairness [23, 24, 25, 26]. Only Loc et al. [18] examined detector fairness. However, their study focused only on deepfake video datasets, not on GAN- and DM-generated faces. This motivates us to conduct a comprehensive benchmark to evaluate AI face detectors’ fairness.

3 The Demographically Annotated AI-Face Dataset

To address the prohibitive time consuming of manual annotation, we introduce two phases to build our dataset: Annotator Development and Demographically Annotation Generation, as shown in Fig. 2.

3.1 Phase 1: Annotator Development

Problem Definition. There are existing online software (e.g., Face++ [42]) and open-source tools (e.g., InsightFace [43]) for face attribute prediction. However, they fall short of our task due to two reasons: 1) They are mostly designed for face recognition and trained on datasets of real face images but lack generalization capability for annotating AI-generated face images. 2) They do not provide uncertainty scores for their predictions that can be used to identify mispredicted samples for further annotation correction. Given a training dataset $\mathbb{D}=\{(X_{i},G_{i},A_{i},R_{i})\}_{i=1}^{n}$ with size $n$ , where $X_{i}$ , $G_{i}\in\{Female,Male\}$ , $A_{i}\in\{Young,Middle\text{-}aged,Senior,Others\}$ , and $R_{i}\in\{Asian,White,Black,Others\}$ represent the $i$ -th face image, and its gender, age, and race labels/attributes, respectively. Our goal is to design a lightweight, generalizable annotator based on $\mathbb{D}$ to predict facial demographic attributes with uncertainty scores for each face image in our dataset.

Annotator. Architecture: We utilize CLIP [27] for its strong zero-shot and few-shot learning capabilities. Leveraging CLIP’s pre-training on diverse datasets, we create a lightweight annotator for facial images. Our annotator employs a frozen pre-trained CLIP ViT L/14 [44] as a feature extractor $\mathbf{E}$ followed by a trainable 3-layer Multilayer Perceptron (MLP) as a multi-task (i.e., gender, age, and race prediction) classifier parameterized by $\theta$ . Loss: For each image $X_{i}$ , its feature $f_{i}$ is obtained through $f_{i}=\mathbf{E}(X_{i})$ and then is fed into the MLP multi-task classifier with conventional classification losses for face attribute prediction. The learning objective is formulated as: $\mathcal{L}(\theta)=C(\widetilde{h}(f_{i}),G_{i})+C(\overline{h}(f_{i}),A_{i})% +C(\widehat{h}(f_{i}),R_{i})$ , where $C(\cdot,\cdot)$ represents the (binary) cross-entropy (CE) loss. $\widetilde{h}$ , $\overline{h}$ , and $\widehat{h}$ represent the classification heads for gender, age, and race, respectively. Optimization: Traditional optimization methods like stochastic gradient descent can lead to poor model generalization due to sharp loss landscapes with multiple local and global minima. To address this, we use Sharpness-Aware Minimization (SAM) [45] to enhance our annotator’s generalization by flattening the loss landscape. Specifically, flattening is attained by determining the optimal $\epsilon^{*}$ for perturbing model parameters $\theta$ to maximize the loss, formulated as: $\epsilon^{*}=\arg\max_{\|\epsilon\|_{2}\leq\gamma}{\mathcal{L}}\textbf{(}% \theta+\epsilon\textbf{)}\approx\arg\max_{\|\epsilon\|_{2}\leq\gamma}\epsilon^% {\top}\nabla_{\theta}\mathcal{L}=\gamma\texttt{sign}(\nabla_{\theta}\mathcal{L})$ , where $\gamma$ controls the perturbation magnitude. This is approximated using a first-order Taylor expansion, assuming $\epsilon$ is small. The final equation is obtained by solving a dual norm problem, where sign represents a sign function and $\nabla_{\theta}\mathcal{L}$ being the gradient of $\mathcal{L}$ with respect to $\theta$ . As a result, the model parameters are updated by solving: $\min_{\theta}\mathcal{L}\textbf{(}\theta+\epsilon^{*}\textbf{)}$ .

Uncertainty Estimation. Although the high prediction performance of our annotator can be obtained, the labels may still be mispredicted due to the ambiguity of the face images (see an example in Fig. 3). Therefore, it is crucial to provide an uncertainty score for each prediction from the annotator. To this end, inspired by [46], we incorporate dropout techniques at each layer of MLP for uncertainty estimation in testing. This involves performing $k$ stochastic forward passes for a given test image $X$ , each with a unique dropout pattern. So, we can obtain $k$ distinct softmax outputs for each demographic attribute $a$ , denoted as $\{x^{1,(a)},...,x^{k,(a)}\}$ . Then, the uncertainty score for $a$ on image $X$ (denoted as $V(X^{(a)})$ ) is calculated as $V(X^{(a)})=1-\Big{\{}\frac{1-\rho}{k}\sum_{i=1}^{k}x^{i,(a)}-\frac{\rho}{k^{2}% }\sum_{i=1}^{k}\sum_{j=1}^{k}|x^{i,(a)}-x^{j,(a)}|\Big{\}}$ , where $\rho\in[0,1]$ is a user-defined parameter to counterweight the measure of centrality (i.e., the first term in $\{\}$ indicates the likelihood of the prediction being correct) and dispersion (i.e., the second term in $\{\}$ reflects the consensus among the stochastic outputs).

Evaluation. To demonstrate our annotator’s effectiveness, we will answer the following questions: Q1: How are the general performance and generalization capability of our annotator compared with the baselines? Q2: How does sample difficulty affect the annotator’s performance? In leveraging the good generalization capabilities of CLIP, our annotator is trained on the VGGFace2 [47] dataset, which contains 9.1K individuals with 3.3M images. More importantly, [48] provides comprehensive demographic annotations for this dataset. We compare our annotator with the current state-of-the-art face attribute prediction tools Face++ [42] and InsightFace [43]. Since they do not offer predictions for the race attribute, our evaluation is confined to gender and age. The mean and standard deviation are reported based on 5 random runs. More details are in Appendix A.1.1.

For Q1, Setting: We perform intra-domain (train on VGGFace2, test on its official test set) and cross-domain (train on VGGFace2, test on four AI-generated face datasets) evaluations. Specifically, A-FF++, A-DFDC, A-DFD, and A-Celeb-DF-v2 are selected from [19] for cross-domain evaluation. Since A-DFD and A-Celeb-DF-v2 have limited age and race annotations, our evaluation of these two is confined to gender. These datasets are chosen because they closely match our objective and are not used to train Face++ and InsightFace. Results: The ‘All’ results in Table 3 demonstrate our annotator’s superiority in general performance and generalization capability against Face++ and InsightFace. Under intra-domain evaluation, it surpasses the second-best method, Face++, by 5.8% on gender and 18.9% on age. In cross-domain evaluation, our annotator maintains high accuracy on all datasets, reflecting good generalization. Remarkably, on the A-FF++ dataset, our annotator outperforms Face++ by a substantial margin of up to 11.4% and InsightFace by 16.1% on age.

Level

All

Easy

Medium

Hard

Type

Dataset

Attribute

InsightFace

[43]

Face++

[42]

Ours

InsightFace

[43]

Face++

[42]

Ours

InsightFace

[43]

Face++

[42]

Ours

InsightFace

[43]

Face++

[42]

Ours

76.7289

78.0764

83.8978

97.0133

97.0863

99.7333

74.2400

75.356

87.5333

58.9333

61.787

64.4267

Gender

(0.4985)

(0.4266)

(0.3697)

(0.1293)

(0.3414)

(0.1265)

(0.8182)

(0.5938)

(0.5007)

(0.5481)

(0.3445)

(0.4818)

54.4311

58.4889

77.4044

68.000

73.0067

98.4133

49.8000

53.2467

78.9733

45.4933

49.2134

54.8267

Intra- Domain

VGGFace2 [47]

Age

(0.7443)

(0.7341)

(0.6714)

(0.5530)

(0.6534)

(0.1543)

(0.6613)

(0.8465)

(1.0771)

(1.0186)

(0.7025)

(0.7827)

84.9733

89.1714

91.3000

96.8267

98.0528

98.9333

88.0933

94.3074

98.8667

70.0000

75.1539

76.1000

Gender

(0.4651)

(0.1974)

(0.2058)

(0.3832)

(0.1483)

(0.0943)

(0.4668)

(0.2586)

(0.1033)

(0.5452)

(0.1854)

(0.4197)

59.4867

64.1893

75.5393

71.254

80.5980

93.1980

58.1720

63.8340

81.1960

49.0340

48.1360

52.2240

A-FF++ [19]

Age

(0.9291)

(0.7609)

(0.5130)

(0.5973)

(0.4140)

(0.3110)

(0.4489)

(0.6733)

(0.4702)

(1.7410)

(1.1954)

(0.7577)

70.1111

76.0917

78.2922

85.8533

92.1414

96.2933

68.4666

73.5088

76.2005

56.0133

62.6249

62.3334

Gender

(0.5037)

(0.6290)

(0.5178)

(0.5239)

(0.5447)

(0.5927)

(0.5667)

(0.4910)

(0.5028)

(0.5014)

(0.8513)

(0.4580)

66.6967

69.5907

77.1800

72.1580

84.5000

95.3820

64.398

64.238

73.1620

63.5340

60.034

62.9960

A-DFDC [19]

Age

(0.8015)

(0.5687)

(0.6300)

(0.6785)

(0.4908)

(0.3247)

(0.9182)

(0.5423)

(0.8592)

(0.8078)

(0.6730)

(0.7061)

66.7156

70.7983

74.9822

85.5467

88.9791

94.0400

60.5467

64.2144

70.0267

54.0533

59.2015

60.8800

A-DFD [19]

Gender

(0.7681)

(1.2229)

(0.6029)

(0.6791)

(0.8297)

(0.2999)

(0.9017)

(1.3436)

(0.9471)

(0.7235)

(1.4953)

(0.5616)

91.9244

90.8100

95.1489

98.9733

98.1867

99.9867

94.0000

94.4933

99.7600

82.8000

79.7500

85.7000

Cross- Domain

A-Celeb- DF-v2[19]

Gender

(0.3003)

(0.4487)

(0.4088)

(0.1769)

(0.2286)

(0.0267)

(0.3239)

(0.5052)

(0.0998)

(0.4000)

(0.6124)

(1.1000)

Table 3: Comparing our annotator against Face++ [42] and InsightFace [43] under intra-domain and cross-domain evaluations (Accuracy (%)) on different levels of sample difficulty. The prediction mean and standard deviation (in parentheses) are reported. The best results are shown in Bold. More results are in Appendix A.1.3.

For Q2, Setting: We also design a stratified evaluation method by separating each test dataset into three subsets—Easy, Medium, and Hard based on the estimated uncertainty scores. Specifically, for each demographic attribute $a$ , we define two thresholds $t_{1}^{a}$ and $t_{2}^{a}$ , where $t_{1}^{a}<t_{2}^{a}$ (more details are in Appendix A.1.2). Then, we have $Easy={\{X\mid V(X^{(a)})<t_{1}^{a}\}}$ , $Medium=\{X\mid t_{1}^{a}\leq V(X^{(a)})\leq t_{2}^{a}\}$ , and $Hard=\{X\mid V(X^{(a)})>t_{2}^{a}\}$ . Next, we sample 1,500 images from each subset. This stratification is crucial for a thorough examination of the model’s performance across a broad spectrum of data challenges. To avoid attribute-specific biases, each subset is balanced with respect to attribute. Results: Table 3 illustrates that while all methods show decreased accuracy as the sample difficulty level increases, our annotator demonstrates greater resilience. For example, under intra-domain evaluation, our annotator’s gender performance drops by 10.2% from easy to medium difficulty, compared to Face++’s 21.7% drop. In cross-domain scenario, our annotator experiences a 14.3% reduction on gender in A-Celeb-DF-v2 [19], versus InsightFace’s 16.2% from easy to hard.

3.2 Phase 2: Demographically Annotation Generation

Data Collection. We build our AI-Face dataset by collecting and integrating public AI-generated face images sourced from academic publications, GitHub repositories, and commercial tools. More details are in Appendix A.2.1. Specifically, the fake face images in our dataset originate from 4 Deepfake Video datasets (i.e., A-FF++ [19], A-DFDC [19], A-DFC [19], and A-Celeb-DF-v2 [19]), generated by 10 GAN models (i.e., AttGAN [49], MMDGAN [50], StarGAN [49], StyleGANs [49, 51, 52], MSGGAN [50], ProGAN [53], STGAN [50], and VQGAN [54]), and 8 DM models (i.e., DALLE2 [55], IF [55], Midjourney [55], DCFace [56], Latent Diffusiin [57], Palette [58], Stable Diffusion v1.5 [59], Stable Diffusion Inpainting [59]). This constitutes a total of 1,245,660 fake face images in our dataset. These fake images are correspondingly generated from 8 real source datasets (i.e., FFHQ [6], CASIA-WebFace [37], IMDB-WIKI [38], CelebA [36], and real images from FF++ [2], DFDC [39], DFD [40], and Celeb-DF-v2 [41]). This constitutes a total of 866,096 real face images in our dataset. In general, our dataset contains 30 subsets and 37 generation methods (i.e., 5 in A-FF++, 5 in A-DFD, 8 in A-DFDC, 1 in A-Celeb-DF-v2, 10 GANs, and 8 DMs). We use RetinaFace [60] for detecting and cropping faces to ensure each image only contains one face.

Annotator Prediction. For our collected images, annotation generation is iterative, integrating uncertainty scores into each prediction by our annotator in Phase 1, as shown in Fig. 2.

Human Correction. As described in ‘Uncertainty Estimation’ in Section 3.1, the annotator may mispredict ambiguous face images, necessitating human review and correction. To this end, we propose two annotation correction strategies: 1) For subsets that have the same images and demographic attribute classes as those in existing datasets, such as A-FF++ [19] and A-DFDC [19], we filter out images that may need human correction based on annotation inconsistency.

2) For the rest of the subsets, we identify the most ambiguous images that need human correction based on uncertainty scores. Specifically, for demographic attribute $a_{j}$ on subset $j$ , we define a specific threshold $t^{a_{j}}$ (more details are in Appendix A.2.2). If $V(X^{(a_{j})})>t^{a_{j}}$ , the annotation for attribute $a_{j}$ of the image $X$ will undergo a verification process, potentially requiring human re-annotation (see Fig. 3). In practice, we recruit three humans to correct the filtered images, consolidating their evaluations with a majority vote to finalize annotations.

Evaluation. To estimate our dataset’s quality, we will answer the following questions: Q1: Can we directly incorporate the existing annotations into our dataset? Q2: How is the effectiveness of human correction? Q3: How is the overall annotations’ quality of our dataset?

		Gender			Age			Race
Type	Datasets	ACC(%)	Precision(%)	Recall(%)	ACC(%)	Precision(%)	Recall(%)	ACC(%)	Precision(%)	Recall(%)
	A-FF++ [19]	8.0163	17.3354	5.8314	19.9002	30.6658	29.6071	28.7865	35.7122	41.1687
	Ours-FF++ (w/o Correction)	91.9837	82.6646	94.1684	21.1830	32.1232	45.7231	45.9775	50.3803	40.1949
	A-DFDC [19]	20.2252	27.5332	21.6538	16.7493	29.0640	29.5519	18.1115	15.1092	22.0637
For Q1	Ours-DFDC (w/o Correction)	79.7748	72.4668	78.3462	45.9748	49.4734	48.7861	70.9001	64.7655	65.1608
	Ours (w/o Correction)	83.4167	83.4167	83.4242	43.8333	43.8333	54.1792	67.4167	65.0718	59.2350
For Q2	Ours	84.8333	84.8738	84.8599	44.7500	44.0937	54.6033	68.8333	66.6440	61.3225
For Q3	Ours	98.6667	98.6688	98.6667	56.2500	50.1748	53.0514	86.2500	75.5216	67.4076

Table 4: Evaluation results of our dataset annotation quality for questions Q1, Q2, and Q3. ‘Ours-FF++ (w/o Correction),’ ‘Ours-DFDC (w/o Correction),’ and ‘Ours (w/o Correction)’ represent our predicted annotations on A-FF++, A-DFDC, and our entire dataset without human correction, respectively. ACC represents Accuracy.

For Q1, Setting: We compare our dataset’s annotation quality before human correction on A-FF++ (i.e., Ours-FF++ (w/o Correction)) and A-DFDC (i.e., Ours-DFDC (w/o Correction)) against their existing annotation from [19]. We regard human re-labeled annotations as the ground truth. Results: The results in Table 4 ‘For Q1’ show superior annotation accuracy of our datasets. For example, Ours-FF++ (w/o Correction) surpasses A-FF++ by 83.97% in gender accuracy, and Ours-DFDC (w/o Correction) exceeds A-DFDC by 59.55%. The large performance indicates that identified images by annotation inconsistency are mislabeled in A-FF++ [19] and A-DFDC [19], and thus cannot be directly merged into our dataset. Some examples are shown in Appendix A.2.3.

For Q2, Setting: We consider two dataset versions: 1) Ours (w/o Correction), where annotations are not corrected by humans. 2) Ours, where annotations are corrected by humans. With the help of the uncertainty score, we sample 1,200 attribute-balanced images (400 easy, 400 medium, and 400 hard) from the whole dataset to ensure a fair evaluation. Three humans re-annotated these images to establish ground truth. Results: Table 4 ‘For Q2’ shows that human corrections improve performance across all attributes, increasing accuracy by 1.42% for gender, 0.92% for age, and 1.42% for race, validating the effectiveness of our correction strategy. More results see Appendix A.2.4.

For Q3, Setting: We randomly sample 1,200 images from the whole dataset. Three humans also re-annotated these images to create ground truth. Results: As shown in Table 4 ‘For Q3’, Ours reflects the approximate overall annotation quality of our dataset. Notably, the annotations of gender and race attributes show high correctness (e.g., 98.6667% ACC on gender and 86.2500% ACC on race). However, the age annotation shows a lower accuracy since it is challenging to differentiate.

4 Fairness Benchmark Experiments

In this section, we estimate the existing AI-generated image detectors’ fairness performance alongside their utility on our AI-Face Dataset (80%/20% for Train/Test). Our goal is to show the significance of our dataset and expose the fairness issues of recent detectors in combating AI-generated faces.

Detection Methods. Our benchmark has implemented 12 detectors, as detailed in Appendix B.1. The methodologies cover a spectrum that is specifically tailored to detect AI-generated faces from Deepfake Videos, GANs, and DMs. They can be classified into four types: Naive detectors: refer to backbone models that can be directly utilized as the detector for binary classification, including CNN-based (i.e., Xception [61] and EfficientB4 [62]) and transformer-based (i.e., ViT-B/16 [63]). Frequency-based: explore the frequency domain for forgery detection (i.e., F3Net [64], SPSL [65], and SRM [66]). Spatial-based: focus on mining spatial characteristics (e.g., texture) within images for detection (i.e., UCF [16], UnivFD [67], and CORE [68]). Fairness-enhanced: focus on improving fairness in AI-generated face detection by designing specific algorithms (i.e., DAW-FDD [20], DAG-FDD [20], and PG-FDD [21]). Implementation and training details refer to Appendix B.2.

Model Type

Naive

Frequency

Spatial

Fairness-enhanced

Measure

Attribute

Metric

Xception

[61]

EfficientB4

[62]

ViT-B/16

[63]

F3Net

[64]

SPSL

[65]

SRM

[66]

UCF

[16]

UnivFD

[67]

CORE

[68]

DAW-FDD

[20]

DAG-FDD

[20]

PG-FDD

[21]

F_{MEO}(\%)

0.387

1.176

0.187

0.279

0.454

0.533

0.305

0.458

1.635

0.404

0.272

0.236

F_{DP}(\%)

2.843

2.052

2.489

2.941

2.998

2.433

2.890

2.456

1.977

2.979

2.799

2.614

F_{OAE}(\%)

0.271

0.595

0.422

0.086

0.188

0.268

0.169

0.557

0.977

0.123

0.192

0.134

Gender

F_{EO}(\%)

0.439

1.229

0.235

0.552

0.577

0.536

0.346

0.490

1.846

0.699

0.407

0.237

F_{MEO}(\%)

4.386

8.307

13.078

3.098

4.736

5.470

3.188

14.663

16.001

3.461

3.344

1.956

F_{DP}(\%)

18.248

19.691

18.446

18.282

18.822

16.182

18.770

23.542

24.163

18.306

18.288

18.040

F_{OAE}(\%)

3.509

5.659

5.351

2.217

2.201

4.044

1.847

6.505

5.105

1.365

1.847

1.132

Race

F_{EO}(\%)

10.863

19.921

24.002

7.052

7.282

11.602

6.311

30.947

24.015

6.948

6.439

4.039

F_{MEO}(\%)

1.695

3.028

8.931

1.319

1.025

1.090

0.854

5.818

6.964

2.838

0.809

0.781

F_{DP}(\%)

6.242

6.724

6.264

6.357

6.340

5.905

6.257

6.260

5.030

6.249

6.140

6.098

F_{OAE}(\%)

1.028

1.619

3.948

1.017

0.710

0.934

0.635

4.966

3.652

2.610

0.606

0.506

Age

F_{EO}(\%)

4.116

6.080

12.888

3.696

2.827

3.116

2.479

15.252

8.382

7.361

2.171

1.587

F_{MEO}(\%)

7.113

9.999

14.667

4.739

7.320

9.731

4.606

17.606

19.303

5.316

4.708

2.604

F_{DP}(\%)

20.675

20.963

20.114

20.492

20.242

19.112

20.704

24.366

25.892

20.373

19.940

20.402

F_{OAE}(\%)

6.174

9.181

7.711

3.692

3.744

6.498

3.061

11.802

6.035

2.641

3.174

1.830

Intersection

F_{EO}(\%)

24.520

42.330

49.075

16.699

16.257

25.983

13.932

68.449

47.016

14.539

14.118

8.618

Individual

F_{IND}(\%)

112.067

585.935

0.125

46.083

22.982

1.383

3.246

8.606

0.598

28.437

13.706

0.477

Avg-

F_{R}

6.824

9.529

7.706

5.941

6.647

5.471

4.353

9.941

9.235

6.118

3.765

2.059

Fairness↓

Avg-

F_{MR}

8.020

6.020

7.843

3.981

ACC(%)

97.639

95.404

93.719

98.229

98.274

97.978

98.635

90.229

96.087

97.316

98.543

99.079

AUC(%)

99.768

99.117

98.914

99.826

99.786

99.767

99.885

96.030

98.846

99.703

99.871

99.937

AP(%)

99.846

99.359

99.240

99.885

99.853

99.829

99.917

96.973

98.987

99.802

99.916

99.956

Utility↑

EER(%)

2.388

4.794

5.829

1.741

1.610

2.134

1.365

10.680

4.656

2.701

1.365

1.212

Training Time / Epoch

1h35min

3h07min

3h26min

1h41min

1h37min

4h05min

5h10min

5h07min

1h36min

1h45min

1h38min

7h45min

Table 5: Overall performance. Top 3 values on each metric are highlighted in green, blue, and yellow.

Evaluation Metrics. To provide a comprehensive benchmarking, we consider 5 fairness metrics commonly used in fairness community [69, 70, 71, 72, 73] and 4 widely used utility metrics. For fairness metrics, we consider Demographic Parity ( $F_{DP}$ ) [69, 70], Max Equalized Odds ( $F_{MEO}$ ) [72], Equal Odds ( $F_{EO}$ ) [71], and Overall Accuracy Equality ( $F_{OAE}$ ) [72] for evaluating group (e.g., gender) and intersectional (e.g., individuals of a specific race and simultaneously a specific gender) fairness. We also use individual fairness ( $F_{IND}$ ) [73, 74] (i.e., similar individuals should have similar predicted outcomes) for estimation. Fairness metrics definition can be found in Appendix B.3. To compare detectors’ performance clearly and fairly, we define the Average Fairness Rank (Avg- $F_{R}$ ), which ranks each detector on each fairness metric and averages these ranks. We also define Avg- $F_{MR}$ for the average rank across methods within a model type. For utility metrics, we employ Accuracy (ACC), the Area Under the ROC Curve (AUC), Average Precision (AP), and Equal Error Rate (EER).

Results. Overall Performance. Table 5 reports the overall performance on our AI-Face test set. Our observations are: 1) Most detectors do not have fairness except for Fairness-enhanced detectors, which demonstrate relatively lower performance disparities. 2) The top 3 performing methods are PG-FDD [21], DAG-FDD [20], and UCF [16] according to Avg- $F_{R}$ . 3) According to Avg- $F_{MR}$ , Fairness-enhanced detectors demonstrate superior performance. Frequency detectors surpass both Spatial and Naive detectors. A possible reason is that frequency features are more focused on the forgery trace while weakening the demographic features. This highlights a potential avenue for future research to enhance detector fairness by integrating frequency features with fairness-enhanced algorithms. 4) 9 out of 12 detectors have an AUC higher than 99%, demonstrating our AI-Face dataset is significant for training AI-face detectors in resulting high utility. 5) PG-FDD demonstrates superior performance but has a long training time, which can be explored and addressed in the future.

Performance on Different Subsets. Fig. 4 demonstrates the intersectional $F_{EO}$ and AUC performance of detectors on each test subset (e.g., subsets originate from different generative methods). We observe that the fairness performance varies a lot among different generative methods in every detector. The largest bias on most detectors comes from detecting face images generated by STGAN [75] and Commercial Tools (CT), including DALLE2 [55], IF [55], and Midjourney [55]. Moreover, the stable utility demonstrates our dataset’s expansiveness and diversity, enabling effective training to detect AI-generated faces from various generative methods. Full evaluation results are in Appendix B.4.

Performance on Different Subgroups. We conduct an analysis of all detectors on intersectional subgroups: Male-White (M-W), Male-Black (M-B), Male-Asian (M-A), Male-Others (M-A), Female-White (F-W), Female-Black (F-B), Female-Asian (F-A), Female-Others (F-O). As shown in Fig. 5, it plots the ratios of FPR for each subgroup to a reference group (M-W). 1) It is clear that facial images of M-A, F-B, and F-A are more likely to be mistakenly detected as fake than facial images of M-W. 2) However, the FPR of M-W is higher than others in DAW-FDD. This highlights a challenge in algorithmic fairness methods: improving performance for minority groups can inadvertently raise the error rate for the majority group (e.g., M-W). See demographic distribution in Appendix A.2.1.

Fairness Robustness Evaluation. Images spread on public platforms usually undergo post-processing. Therefore, it is important to estimate the capability of detectors to preserve fairness robustness while handling distorted images. We apply 6 post-processing methods: Random Crop (RC) [76], Rotation (RT) [25], Brightness Contrast (BC) [25], Hue Saturation Value (HSV) [25], Gaussian Blur (GB) [25], and JEPG Compression (JC) [77] to the test images (see Appendix B.5 for more details). Fig. 6 shows each detector’s intersectional $F_{EO}$ and AUC performance changes after using post-processing. Our observations are: 1) These impairments tend to wash out forensic traces, to the point that detectors have significant performance degradation. 2) Recent Fairness-enhanced detectors struggle to maintain fairness when images undergo post-processing. 3) Transform-based models (i.e., ViT-B/16 [63] and UnivFD [67]) demonstrate stronger robustness compared with CNN-based models. 4) JEPG Compression and Gaussian Blur cause notably greater performance degradation compared to others. See Appendix B.6 for more robustness analysis with respect to different degrees of post-processing.

		Dataset
		A-DF-1.0 [19]			DF-Platter [35]			GenData [22]
		Fairness(%)↓		Utility(%)↑	Fairness(%)↓		Utility(%)↑	Fairness(%)↓		Utility(%)↑
Model Type	Detector	$F_{OAE}$	$F_{EO}$	AUC	$F_{OAE}$	$F_{EO}$	AUC	$F_{OAE}$	$F_{EO}$	AUC	Avg- $F_{R}$
	Xception [61]	4.227(+3.956)	9.198(+8.759)	82.479(-17.289)	2.308(+2.037)	8.691(+8.252)	75.933(-23.835)	0.438(+0.167)	1.724(+1.285)	94.315(-5.453)	5.167
	EfficientB4 [62]	3.689(+3.094)	17.017(+15.788)	61.436(-37.681)	4.459(+3.864)	10.191(+8.962)	63.871(-35.246)	0.001(-0.594)	3.621(+2.392)	87.522(-11.595)	8.000
Naive	ViT-B/16 [63]	4.45(+4.028)	9.154(+8.919)	70.896(-28.018)	2.531(+2.109)	5.557(+5.322)	68.935(-29.979)	1.249(+0.827)	2.874(+2.639)	89.109(-9.805)	6.667
	F3Net [64]	1.749(+1.663)	19.484(+18.932)	86.265(-13.561)	2.995(+2.909)	5.445(+4.893)	82.421(-17.405)	0.155(+0.069)	2.927(+2.375)	93.882(-5.944)	6.000
	SPSL [65]	8.497(+8.309)	2.430(+1.853)	75.177(-24.609)	3.323(+3.135)	8.966(+8.389)	82.024(-17.762)	0.138(-0.050)	2.321(+1.744)	94.320(-5.466)	6.167
Frequency	SRM [66]	3.708(+3.440)	1.169(+0.633)	65.779(-33.988)	4.976(+4.708)	33.702(+33.166)	72.777(-26.990)	1.545(+1.277)	2.378(+1.842)	94.130(-5.637)	8.000
	UCF [16]	2.930(+2.761)	9.924(+9.578)	83.260(-16.625)	3.536(+3.367)	9.395(+9.049)	83.92(-15.965)	1.346(+1.177)	1.377(+1.031)	94.948(-4.937)	6.500
	UnivFD [67]	14.149(+13.592)	1.833(+1.343)	65.810(-30.220)	7.686(+7.129)	11.701(+11.211)	69.483(-26.547)	0.903(+0.346)	2.227(+1.737)	85.965(-10.065)	8.167
Spatial	CORE [68]	0.308(-0.669)	11.854(+10.008)	79.222(-19.624)	3.966(+2.989)	5.267(+3.421)	81.264(-17.582)	0.005(-0.972)	2.943(+1.097)	94.329(-4.517)	5.667
	DAW-FDD [20]	5.040(+4.917)	4.993(+4.294)	80.308(-19.395)	2.577(+2.454)	7.253(+6.554)	78.562(-21.141)	0.205(+0.082)	2.708(+2.009)	93.876(-5.827)	6.000
	DAG-FDD [20]	4.279(+4.087)	13.565(+13.158)	85.859(-14.012)	3.885(+3.693)	7.350(+6.943)	83.153(-16.718)	1.062(+0.870)	1.688(+1.281)	94.326(-5.545)	7.167
Fairness- enhanced	PG-FDD [21]	4.263(+4.129)	11.077(+10.840)	81.174(-18.763)	1.984(+1.850)	4.715(+4.478)	84.572(-15.365)	1.205(+1.071)	1.159(+0.922)	94.962(-4.975)	4.500

Table 6: Fairness generalization results based on the gender attribute. The smallest performance changes (in parentheses) and the best performance are in bold and in red, respectively.

Fairness Generalization Evaluation. To evaluate detectors’ fairness generalization capability, we train them on AI-Face and test them on A-DF-1.0, DF-Platter, and GenData, none of which are part of AI-Face. Results on gender attribute in Table 6 show that: 1) According to Avg- $F_{R}$ , the top three methods excelling in fairness preservation are PG-FDD, Xception, and CORE. PG-FDD, specifically designed for fairness generalization, leads to overall performance. However, it does not excel in terms of performance changes compared with intra-domain test results from Table 5, indicating room for improvement in its generalization capabilities. 2) CORE is notable for demonstrating negative fairness performance changes on A-DF-1.0 and GenData, suggesting techniques within CORE that could be potentially explored to enhance fairness generalization. More results are in Appendix B.7.

Effect of Increasing Training Set Size. We randomly sample 20%, 40%, 60%, and 80% of each training subset from AI-Face to assess the impact of training size on performance. Key observations from Fig. 7: 1) The performance of UnivFD changes slightest and cannot be improved with the increasing of data size.

2) Overall, detectors’ performance improves with larger training size, though few show fluctuations (e.g., ViT-B/16 and CORE). 3) A larger training set may improve utility but not always fairness. For example, Xception and SRM show increased utility when training size grows from 60% to 80%, but fairness worsens. Similar trends are observed in DAG-FDD and SPSL when the training set size increases from 40% to 60%. See Appendix B.8 for full results.

Discussion. According to the above experiments, we summarize the unsolved fairness problems in recent detectors: 1) Detectors’ fairness is unstable when detecting face images generated by different generative methods, indicating a future direction for enhancing fairness stability since new generative models continue to emerge. 2) Even though fairness-enhanced detectors exhibit small overall fairness metrics, they still show biased detection towards minority groups. Future studies should be more cautious when designing fair detectors to ensure balanced performance across all demographic groups. 3) There is currently no reliable detector, as all detectors experience severe large performance degradation under image post-processing and cross-domain evaluation. Future studies should aim to develop a unified framework that ensures fairness, robustness, and generalization, as these three characteristics are essential for creating a reliable detector.

5 Conclusion

This work presents the first demographically annotated million-scale AI-Face dataset, serving as a pivotal foundation for addressing the urgent need for developing fair AI face detectors. Based on our AI-Face dataset, we conduct the first comprehensive fairness benchmark, shedding light on the fairness performance and challenges of current representative AI face detectors. Our findings can inspire and guide researchers in refining current models and exploring new methods to mitigate bias. Limitation and Future Work: One limitation is that age annotations in our AI-Face dataset have relatively lower accuracy as the age attribute is often too ambiguous to predict. We will improve our annotator’s accuracy in predicting age attributes in the future. Additionally, we plan to extend our fairness benchmark to evaluate large language models like LLaMA2 [78] and GPT4 [79] for detecting AI faces. Social Impact: Malicious users could misuse AI-generated face images from our dataset to create fake social media profiles and spread misinformation. To mitigate this risk, only users who submit a signed end-user license agreement (EULA) will be granted access to our dataset.

Acknowledgment

This work is supported by the U.S. National Science Foundation (NSF) under grant IIS-2348419 and the National Artificial Intelligence Research Resource (NAIRR) Pilot and TACC Lonestar6. The views, opinions and/or findings expressed are those of the author and should not be interpreted as representing the official views or policies of NSF and NAIRR Pilot.

References

[1] Li Lin, Neeraj Gupta, Yue Zhang, Hainan Ren, Chun-Hao Liu, Feng Ding, Xin Wang, Xin Li, Luisa Verdoliva, and Shu Hu. Detecting multimedia generated by large ai models: A survey. arXiv preprint arXiv:2402.00045, 2024.
[2] Andreas Rossler, Davide Cozzolino, Luisa Verdoliva, Christian Riess, Justus Thies, and Matthias Nießner. Faceforensics++: Learning to detect manipulated facial images. In Proceedings of the IEEE/CVF international conference on computer vision, pages 1–11, 2019.
[3] Deepfakes github. https://github.com/deepfakes/faceswap. Accessed: 2024-04-17.
[4] Fakeapp. https://www.fakeapp.com/. Accessed: 2024-04-17.
[5] Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale gan training for high fidelity natural image synthesis. In 7th International Conference on Learning Representations, ICLR 2019, 2019.
[6] Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4401–4410, 2019.
[7] Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Analyzing and improving the image quality of stylegan. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8110–8119, 2020.
[8] Tero Karras, Miika Aittala, Samuli Laine, Erik Härkönen, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Alias-free generative adversarial networks. Advances in neural information processing systems, 34:852–863, 2021.
[9] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
[10] Daniel J Tojin T. Eapen. How generative ai can augment human creativity. https://hbr.org/2023/07/how-generative-ai-can-augment-human-creativity, 2023. Accessed: 2024-04-21.
[11] BBC News. Trump supporters target black voters with faked ai images. https://www.bbc.com/news/world-us-canada-68440150, 2024. Accessed: 2023-05-09.
[12] Henrik Skaug Sætra. Generative ai: Here to stay, but for good? Technology in Society, 75:102372, 2023.
[13] Mika Westerlund. The emergence of deepfake technology: A review. Technology innovation management review, 9(11), 2019.
[14] Wenbo Pu, Jing Hu, Xin Wang, Yuezun Li, Shu Hu, Bin Zhu, Rui Song, Qi Song, Xi Wu, and Siwei Lyu. Learning a deep dual-level network for robust deepfake detection. Pattern Recognition, 130:108832, 2022.
[15] Hui Guo, Shu Hu, Xin Wang, Ming-Ching Chang, and Siwei Lyu. Robust attentive deep neural network for detecting gan-generated faces. IEEE Access, 10:32574–32583, 2022.
[16] Zhiyuan Yan, Yong Zhang, Yanbo Fan, and Baoyuan Wu. Ucf: Uncovering common features for generalizable deepfake detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 22412–22423, 2023.
[17] Lorenzo Papa, Lorenzo Faiella, Luca Corvitto, Luca Maiano, and Irene Amerini. On the use of stable diffusion for creating realistic faces: from generation to detection. In 2023 11th International Workshop on Biometrics and Forensics (IWBF), pages 1–6. IEEE, 2023.
[18] Loc Trinh and Yan Liu. An examination of fairness of ai models for deepfake detection. IJCAI, 2021.
[19] Ying Xu, Philipp Terhöst, Marius Pedersen, and Kiran Raja. Analyzing fairness in deepfake detection with massively annotated databases. IEEE Transactions on Technology and Society, 2024.
[20] Yan Ju, Shu Hu, Shan Jia, George H Chen, and Siwei Lyu. Improving fairness in deepfake detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 4655–4665, 2024.
[21] Li Lin, Xinan He, Yan Ju, Xin Wang, Feng Ding, and Shu Hu. Preserving fairness generalization in deepfake detection. CVPR, 2024.
[22] Christopher Teo, Milad Abdollahzadeh, and Ngai-Man Man Cheung. On measuring fairness in generative models. Advances in Neural Information Processing Systems, 36, 2024.
[23] Chuqiao Li, Zhiwu Huang, Danda Pani Paudel, Yabin Wang, Mohamad Shahbazi, Xiaopeng Hong, and Luc Van Gool. A continual deepfake detection benchmark: Dataset, methods, and essentials. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1339–1349, 2023.
[24] Jingyi Deng, Chenhao Lin, Pengbin Hu, Chao Shen, Qian Wang, Qi Li, and Qiming Li. Towards benchmarking and evaluating deepfake detection. IEEE Transactions on Dependable and Secure Computing, 2024.
[25] Zhiyuan Yan, Yong Zhang, Xinhang Yuan, Siwei Lyu, and Baoyuan Wu. Deepfakebench: A comprehensive benchmark of deepfake detection. In NeurIPS, 2023.
[26] Binh M Le, Jiwon Kim, Shahroz Tariq, Kristen Moore, Alsharif Abuadbba, and Simon S Woo. Sok: Facial deepfake detectors. arXiv, 2024.
[27] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
[28] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. Advances in neural information processing systems, 27, 2014.
[29] Midjourney. https://mid-journey.ai/. Accessed: 2024-04-17.
[30] Aditya Ramesh et al. Hierarchical text-conditional image generation with clip latents. arXiv, 1(2):3, 2022.
[31] Donie O’Sullivan. A high school student created a fake 2020 us candidate. twitter verified it. https://cnn.it/3HpHfzz, 2020. Accessed: 2024-04-21.
[32] Shannon Bond. That smiling linkedin profile face might be a computer-generated fake. https://www.npr.org/2022/03/27/1088140809/fake-linkedin-profiles, 2022. Accessed: 2024-04-21.
[33] Liming Jiang, Ren Li, Wayne Wu, Chen Qian, and Chen Change Loy. Deeperforensics-1.0: A large-scale dataset for real-world face forgery detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2889–2898, 2020.
[34] Kartik Narayan, Harsh Agarwal, Kartik Thakral, Surbhi Mittal, Mayank Vatsa, and Richa Singh. Deephy: On deepfake phylogeny. In 2022 IEEE International Joint Conference on Biometrics (IJCB), pages 1–10. IEEE, 2022.
[35] Kartik Narayan, Harsh Agarwal, Kartik Thakral, Surbhi Mittal, Mayank Vatsa, and Richa Singh. Df-platter: multi-face heterogeneous deepfake dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9739–9748, 2023.
[36] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), December 2015.
[37] Dong Yi, Zhen Lei, Shengcai Liao, and Stan Z Li. Learning face representation from scratch. arXiv, 2014.
[38] Rasmus Rothe, Radu Timofte, and Luc Van Gool. Dex: Deep expectation of apparent age from a single image. In Proceedings of the IEEE international conference on computer vision workshops, pages 10–15, 2015.
[39] Brian Dolhansky, Joanna Bitton, Ben Pflaum, Jikuo Lu, Russ Howes, Menglin Wang, and Cristian Canton Ferrer. The deepfake detection challenge (dfdc) dataset. arXiv preprint arXiv:2006.07397, 2020.
[40] Google Research. Contributing data to deepfake detection research, 2019. Accessed: 2024-04-12.
[41] Yuezun Li, Xin Yang, Pu Sun, Honggang Qi, and Siwei Lyu. Celeb-df: A large-scale challenging dataset for deepfake forensics. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3207–3216, 2020.
[42] Megvii Technology Limited. Face++ Face Detection. https://www.faceplusplus.com/face-detection/. Accessed: 2024-03.
[43] InsightFace Project Contributors. InsightFace: State-of-the-Art Face Analysis Toolbox. https://insightface.ai/. Accessed: 2024-03.
[44] Gabriel Ilharco, Mitchell Wortsman, Ross Wightman, Cade Gordon, Nicholas Carlini, Rohan Taori, Achal Dave, Vaishaal Shankar, Hongseok Namkoong, John Miller, Hannaneh Hajishirzi, Ali Farhadi, and Ludwig Schmidt. Open clip. https://github.com/mlfoundations/open_clip, 2021.
[45] Pierre Foret, Ariel Kleiner, Hossein Mobahi, and Behnam Neyshabur. Sharpness-aware minimization for efficiently improving generalization. In International Conference on Learning Representations, 2020.
[46] Philipp Terhörst, Marco Huber, Jan Niklas Kolf, Ines Zelch, Naser Damer, Florian Kirchbuchner, and Arjan Kuijper. Reliable age and gender estimation from face images: Stating the confidence of model predictions. In 2019 IEEE 10th International Conference on Biometrics Theory, Applications and Systems (BTAS), pages 1–8. IEEE, 2019.
[47] Qiong Cao, Li Shen, Weidi Xie, Omkar M Parkhi, and Andrew Zisserman. Vggface2: A dataset for recognising faces across pose and age. In 2018 13th IEEE international conference on automatic face & gesture recognition (FG 2018), pages 67–74. IEEE, 2018.
[48] Philipp Terhörst, Daniel Fährmann, Jan Niklas Kolf, Naser Damer, Florian Kirchbuchner, and Arjan Kuijper. Maad-face: A massively annotated attribute dataset for face images. IEEE Transactions on Information Forensics and Security, 16:3942–3957, 2021.
[49] Oliver Giudice, Luca Guarnera, and Sebastiano Battiato. Fighting deepfakes by detecting gan dct anomalies. Journal of Imaging, 7(8):128, 2021.
[50] Vishal Asnani, Xi Yin, Tal Hassner, and Xiaoming Liu. Reverse engineering of generative models: Inferring model hyperparameters from generated images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
[51] David Beniaguev. Synthetic faces high quality (sfhq) dataset, 2022.
[52] Zeyu Lu, Di Huang, Lei Bai, Jingjing Qu, Chengyue Wu, Xihui Liu, and Wanli Ouyang. Seeing is not always believing: Benchmarking human and model perception of ai-generated images. Advances in Neural Information Processing Systems, 36, 2024.
[53] L Minh Dang, Syed Ibrahim Hassan, Suhyeon Im, Jaecheol Lee, Sujin Lee, and Hyeonjoon Moon. Deep learning based computer generated face identification using convolutional neural network. Applied Sciences, 8(12):2610, 2018.
[54] Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12873–12883, 2021.
[55] Zhendong Wang, Jianmin Bao, Wengang Zhou, Weilun Wang, Hezhen Hu, Hong Chen, and Houqiang Li. Dire for diffusion-generated image detection. arXiv preprint arXiv:2303.09295, 2023.
[56] Minchul Kim, Feng Liu, Anil Jain, and Xiaoming Liu. Dcface: Synthetic face generation with dual condition diffusion model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12715–12725, 2023.
[57] Riccardo Corvi, Davide Cozzolino, Giada Zingarini, Giovanni Poggi, Koki Nagano, and Luisa Verdoliva. On the detection of synthetic images generated by diffusion models. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023.
[58] Md Awsafur Rahman, Bishmoy Paul, Najibul Haque Sarker, Zaber Ibn Abdul Hakim, and Shaikh Anowarul Fattah. Artifact: A large-scale dataset with artificial and factual images for generalizable and robust synthetic image detection. arXiv e-prints, pages arXiv–2302, 2023.
[59] Haixu Song, Shiyu Huang, Yinpeng Dong, and Wei-Wei Tu. Robustness and generalizability of deepfake detection: A study with diffusion models. arXiv preprint arXiv:2309.02218, 2023.
[60] Jiankang Deng, Jia Guo, Evangelos Ververas, Irene Kotsia, and Stefanos Zafeiriou. Retinaface: Single-shot multi-level face localisation in the wild. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5203–5212, 2020.
[61] François Chollet. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1251–1258, 2017.
[62] Mingxing Tan and Quoc Le. Efficientnet: Rethinking model scaling for convolutional neural networks. In International conference on machine learning, pages 6105–6114. PMLR, 2019.
[63] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. In 9th International Conference on Learning Representations, 2021.
[64] Yuyang Qian, Guojun Yin, Lu Sheng, Zixuan Chen, and Jing Shao. Thinking in frequency: Face forgery detection by mining frequency-aware clues. In European conference on computer vision, pages 86–103. Springer, 2020.
[65] Honggu Liu, Xiaodan Li, Wenbo Zhou, Yuefeng Chen, Yuan He, Hui Xue, Weiming Zhang, and Nenghai Yu. Spatial-phase shallow learning: rethinking face forgery detection in frequency domain. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 772–781, 2021.
[66] Yuchen Luo, Yong Zhang, Junchi Yan, and Wei Liu. Generalizing face forgery detection with high-frequency features. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16317–16326, 2021.
[67] Utkarsh Ojha, Yuheng Li, and Yong Jae Lee. Towards universal fake image detectors that generalize across generative models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24480–24489, 2023.
[68] Yunsheng Ni, Depu Meng, Changqian Yu, Chengbin Quan, Dongchun Ren, and Youjian Zhao. Core: Consistent representation learning for face forgery detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12–21, 2022.
[69] Xiaotian Han, Jianfeng Chi, Yu Chen, Qifan Wang, Han Zhao, Na Zou, and Xia Hu. Ffb: A fair fairness benchmark for in-processing group fairness methods. In ICLR, 2024.
[70] Ninareh Mehrabi, Fred Morstatter, Nripsuta Saxena, Kristina Lerman, and Aram Galstyan. A survey on bias and fairness in machine learning. ACM computing surveys (CSUR), 54(6):1–35, 2021.
[71] Jialu Wang, Xin Eric Wang, and Yang Liu. Understanding instance-level impact of fairness constraints. In International Conference on Machine Learning, pages 23114–23130. PMLR, 2022.
[72] Hao Wang, Luxi He, Rui Gao, and Flavio P Calmon. Aleatoric and epistemic discrimination in classification. ICML, 2023.
[73] Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard Zemel. Fairness through awareness. In Proceedings of the 3rd innovations in theoretical computer science conference, pages 214–226, 2012.
[74] Shu Hu and George H Chen. Fairness in survival analysis with distributionally robust optimization. arXiv, 2023.
[75] Ming Liu, Yukang Ding, Min Xia, Xiao Liu, Errui Ding, Wangmeng Zuo, and Shilei Wen. Stgan: A unified selective transfer network for arbitrary image attribute editing. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3673–3682, 2019.
[76] Federico Cocchi, Lorenzo Baraldi, Samuele Poppi, Marcella Cornia, Lorenzo Baraldi, and Rita Cucchiara. Unveiling the impact of image transformations on deepfake detection: An experimental analysis. In International Conference on Image Analysis and Processing, pages 345–356. Springer, 2023.
[77] Davide Cozzolino, Giovanni Poggi, Riccardo Corvi, Matthias Nießner, and Luisa Verdoliva. Raising the bar of ai-generated image detection with clip. arXiv preprint arXiv:2312.00195, 2023.
[78] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
[79] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
[80] Ying Xu et al. A comprehensive analysis of ai biases in deepfake detection with massively annotated databases. arXiv, 2022.
[81] Eran Eidinger, Roee Enbar, and Tal Hassner. Age and gender estimation of unfiltered faces. IEEE Transactions on information forensics and security, 9(12):2170–2179, 2014.
[82] Robert Williamson and Aditya Menon. Fairness risk measures. In International conference on machine learning, pages 6786–6797. PMLR, 2019.
[83] Daniel Levy, Yair Carmon, John C Duchi, and Aaron Sidford. Large-scale methods for distributionally robust optimization. Advances in Neural Information Processing Systems, 33:8847–8860, 2020.
[84] R Tyrrell Rockafellar, Stanislav Uryasev, et al. Optimization of conditional value-at-risk. Journal of risk, 2:21–42, 2000.
[85] Tatsunori Hashimoto, Megha Srivastava, Hongseok Namkoong, and Percy Liang. Fairness without demographics in repeated loss minimization. In International Conference on Machine Learning, pages 1929–1938. PMLR, 2018.
[86] John C Duchi and Hongseok Namkoong. Learning models with uniform performance via distributionally robust optimization. The Annals of Statistics, 49(3):1378–1406, 2021.
[87] Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé Iii, and Kate Crawford. Datasheets for datasets. Communications of the ACM, 64(12):86–92, 2021.

Appendix

Appendix A The Details of Demographically Annotated AI-Face Dataset

A.1 Phase1: Annotator Development

A.1.1 Annotator Implementation Details

For developing the annotator, all experiments are based on the PyTorch with a single NVIDIA RTX A6000 GPU. For training, we fix the batch size 64, epochs 32, and use Adam optimizer with an initial learning rate $\beta=1e-3$ . Additionally, we employ a Cosine Annealing Learning Rate Scheduler to modulate the learning rate adaptively across the training duration. The hyperparameter $\gamma$ in SAM optimization is set as 0.05. For uncertainty estimation, $k$ and $\rho$ in uncertainty score $V(X^{(a)})$ are set as 100 and 0.2, respectively.

A.1.2 Details of Threshold Settings for Sample Difficulty Level

For Q2, Setting: According to the distribution as shown in Appendix A.2.2, for VGGFace2 [47], A-DFDC [80], and A-DFD [80] test set, the threshold $t_{1}^{Gender}$ and $t_{2}^{Gender}$ are set as 0.25 and 0.4, respectively. And $t_{1}^{Age}$ and $t_{2}^{Age}$ are set as 0.3 and 0.5, respectively. The threshold for gender attribute is more strict than age because gender attribute prediction is a relatively easier task than age, as well as reflecting from the distribution. For A-FF++ [80] and A-Celeb-DF-v2 [80], we adjust the threshold $t_{1}^{Gender}$ to 0.21 and $t_{2}^{Gender}$ to 0.25 in order to get sufficient 1,500 images in each sample difficulty level subset, especially for ‘Hard’ level.

A.1.3 Additional Annotator Evaluation Results

From Table 7 to Table 11 are comparison results of our annotator against baselines InsightFace [43] and Face++[42] on detailed attributes. The findings and results align with the results in Table 3 of the submitted manuscript. For cross-domain evaluation, we additionally choose Adience [81] dataset, where images are manually annotated, consisting of over 26.5k real images of over 2.2k different individuals in unconstrained environments, to further validate the effectiveness and good generalization capability of our annotator. Results in table 12 demonstrate our annotator outperforms InsightFace [43] and Face++[42] again. Overall, one intra-domain dataset (VGGFace2) and five cross-domain datasets (A-FF++, A-DFDC, A-DFD, A-Celeb-DF-v2, and Adience) all validate that our annotator’s superior performance against current state-of-the-art face attribute prediction tools Face++ [42] and InsightFace [43].

Level	Method	VGGFace2 [47]
		Female			Male			Young			Middle_Aged			Senior
		precision	recall	F1	precision	recall	F1	precision	recall	F1	precision	recall	F1	precision	recall	F1
		77.042	83.597	80.060	79.571	72.314	75.584	80.201	34.649	46.991	46.049	63.692	53.327	66.266	77.261	71.320
	Face++ [42]	(0.363)	(0.789)	(0.444)	(0.780)	(0.516)	(0.453)	(2.448)	(1.182)	(1.278)	(0.832)	(1.588)	(0.976)	(1.030)	(1.188)	(0.955)
		76.560	78.062	77.281	76.946	75.395	76.139	78.730	30.533	42.907	41.660	56.653	47.936	61.426	76.107	67.966
	InsightFace [43]	(0.533)	(0.737)	(0.534)	(0.629)	(0.555)	(0.474)	(2.135)	(0.961)	(1.052)	(1.005)	(1.429)	(1.078)	(0.649)	(1.212)	(0.724)
		81.452	89.467	85.158	87.267	78.329	82.401	95.619	80.467	86.659	91.189	75.027	81.665	88.043	76.720	81.747
All	Ours	(0.413)	(0.331)	(0.323)	(0.393)	(0.572)	(0.438)	(0.565)	(1.343)	(0.629)	(0.812)	(0.880)	(0.585)	(0.982)	(1.388)	(0.837)
		97.482	96.742	97.108	96.697	97.439	97.064	89.360	57.507	69.964	58.673	72.205	64.729	79.536	89.546	84.240
	Face++ [42]	(0.604)	(0.549)	(0.329)	(0.545)	(0.642)	(0.355)	(0.670)	(1.825)	(1.403)	(0.837)	(1.397)	(0.699)	(1.119)	(0.600)	(0.687)
		97.625	96.373	96.994	96.420	97.653	97.032	88.544	50.080	63.957	53.146	65.720	58.762	73.657	88.200	80.271
	InsightFace [43]	(0.403)	(0.229)	(0.124)	(0.206)	(0.410)	(0.135)	(0.917)	(1.831)	(1.585)	(0.621)	(1.001)	(0.497)	(0.808)	(0.748)	(0.555)
		99.575	99.893	99.734	99.893	99.573	99.733	99.720	99.760	99.740	99.279	99.000	99.139	99.551	96.480	97.988
Easy	Ours	(0.176)	(0.100)	(0.126)	(0.100)	(0.177)	(0.127)	(0.098)	(0.080)	(0.049)	(0.267)	(0.400)	(0.164)	(0.431)	(0.688)	(0.204)
		72.977	81.336	76.927	78.435	69.245	73.549	76.257	24.427	36.996	41.197	61.838	49.446	62.616	73.719	67.710
	Face++ [42]	(0.292)	(1.246)	(0.679)	(1.101)	(0.681)	(0.618)	(2.732)	(0.709)	(1.057)	(0.961)	(2.458)	(1.472)	(1.171)	(0.503)	(0.759)
		73.710	75.360	74.521	74.807	73.120	73.950	74.222	22.080	34.033	37.372	53.480	43.995	58.070	73.840	65.008
	InsightFace [43]	(0.699)	(1.316)	(0.907)	(1.070)	(0.766)	(0.753)	(2.488)	(0.588)	(0.928)	(1.208)	(2.160)	(1.535)	(0.282)	(1.216)	(0.497)
		82.518	95.253	88.428	94.389	79.813	86.489	96.648	84.200	89.987	95.382	75.920	84.540	87.960	76.800	81.993
Medium	Ours	(0.621)	(0.482)	(0.439)	(0.544)	(0.858)	(0.583)	(0.627)	(1.730)	(0.111)	(0.604)	(1.017)	(0.650)	(0.915)	(1.544)	(1.018)
		60.667	72.714	66.146	63.582	50.259	56.140	74.987	22.012	34.013	38.276	57.032	45.807	56.647	68.517	62.009
	Face++ [42]	(0.194)	(0.571)	(0.323)	(0.694)	(0.226)	(0.385)	(3.942)	(1.012)	(1.374)	(0.697)	(0.910)	(0.758)	(0.799)	(2.462)	(1.420)
		58.345	62.453	60.329	59.611	55.413	57.435	73.425	19.440	30.730	34.462	50.760	41.050	52.550	66.280	58.618
	InsightFace [43]	(0.498)	(0.667)	(0.570)	(0.610)	(0.489)	(0.533)	(2.999)	(0.463)	(0.642)	(1.187)	(1.127)	(1.202)	(0.858)	(1.671)	(1.121)
		62.263	73.255	67.312	67.518	55.600	60.982	90.490	57.440	70.249	78.905	50.160	61.315	76.617	56.880	65.260
Hard	Ours	(0.443)	(0.410)	(0.403)	(0.534)	(0.680)	(0.604)	(0.971)	(2.218)	(1.726)	(1.564)	(1.222)	(0.941)	(1.600)	(1.933)	(1.288)

Table 7: Detailed comparison of our annotator against Face++ [42] and InsightFace [43] on VGGFace2 dataset. ‘All’ denotes the averaged metrics across three levels of sample difficulty: ‘Easy,’ ‘Medium,’ and ‘Hard.’ Prediction mean and standard deviation (in parentheses) of each method across 5 random samplings and testings within each level are reported. The best results are shown in Bold.

Level	Method	A-FF++ [80]
		Female			Male			Young			Middle_Aged			Senior
		precision	recall	F1	precision	recall	F1	precision	recall	F1	precision	recall	F1	precision	recall	F1
		90.106	88.345	89.129	88.560	90.007	89.201	74.707	72.839	73.547	52.934	81.935	63.981	84.769	37.933	51.193
	Face++ [42]	(0.282)	(0.400)	(0.220)	(0.323)	(0.295)	(0.188)	(1.264)	(0.581)	(0.656)	(0.790)	(1.288)	(0.852)	(0.922)	(1.705)	(1.884)
		87.918	81.284	84.344	82.787	88.662	85.533	67.148	83.284	74.317	48.659	74.220	58.725	93.455	20.959	33.854
	InsightFace [43]	(0.458)	(0.861)	(0.560)	(0.625)	(0.472)	(0.399)	(1.429)	(0.427)	(0.959)	(0.658)	(1.532)	(0.831)	(1.228)	(1.708)	(2.315)
		89.851	94.149	91.866	93.177	88.451	90.641	90.888	87.477	88.689	88.461	71.439	78.337	93.390	66.519	77.023
All	Ours	(0.290)	(0.302)	(0.202)	(0.280)	(0.384)	(0.226)	(0.632)	(0.280)	(0.396)	(1.556)	(1.064)	(1.012)	(0.686)	(1.461)	(1.138)
		98.157	97.947	98.052	97.950	98.159	98.054	85.686	93.398	89.372	69.170	89.520	78.036	95.646	58.880	72.888
	Face++ [42]	(0.302)	(0.136)	(0.146)	(0.131)	(0.308)	(0.151)	(0.625)	(1.072)	(0.676)	(0.878)	(0.431)	(0.651)	(0.371)	(0.546)	(0.427)
		97.004	96.640	96.820	96.656	97.013	96.833	76.172	98.120	85.762	61.246	86.840	71.828	98.106	28.800	44.524
	InsightFace [43]	(0.470)	(0.605)	(0.387)	(0.579)	(0.482)	(0.380)	(1.303)	(0.665)	(1.016)	(0.576)	(1.039)	(0.519)	(0.755)	(0.633)	(0.718)
		97.987	99.920	98.944	99.919	97.947	98.923	91.450	100.000	95.530	99.494	94.320	96.838	95.154	85.022	89.802
Easy	Ours	(0.222)	(0.065)	(0.092)	(0.067)	(0.233)	(0.097)	(0.862)	(0.000)	(0.470)	(0.170)	(0.483)	(0.286)	(0.500)	(0.724)	(0.478)
		98.839	89.590	93.985	90.605	98.960	94.597	84.212	80.092	82.098	48.990	86.710	62.604	91.594	25.400	39.746
	Face++ [42]	(0.312)	(0.691)	(0.294)	(0.553)	(0.285)	(0.230)	(0.742)	(0.379)	(0.333)	(0.591)	(0.585)	(0.520)	(0.670)	(1.688)	(2.102)
		95.655	79.813	87.017	82.685	96.373	89.005	69.090	85.800	76.542	45.982	73.680	56.622	96.654	15.040	26.022
	InsightFace [43]	(0.499)	(0.798)	(0.546)	(0.573)	(0.433)	(0.408)	(0.862)	(0.358)	(0.579)	(0.273)	(0.985)	(0.435)	(0.636)	(0.794)	(1.194)
		98.907	98.827	98.866	98.828	98.907	98.867	96.968	97.748	97.356	95.636	79.488	86.812	96.102	63.732	76.632
Medium	Ours	(0.227)	(0.131)	(0.103)	(0.128)	(0.229)	(0.104)	(0.311)	(0.230)	(0.158)	(0.900)	(0.673)	(0.415)	(0.399)	(1.483)	(1.134)
		73.323	77.498	75.352	77.123	72.901	74.952	54.224	45.026	49.172	40.642	69.576	51.302	67.066	29.518	40.946
	Face++ [42]	(0.231)	(0.374)	(0.220)	(0.284)	(0.292)	(0.184)	(2.425)	(0.291)	(0.960)	(0.901)	(2.847)	(1.385)	(1.727)	(2.881)	(3.122)
		71.096	67.400	69.194	69.020	72.600	70.762	56.182	65.932	60.648	38.748	62.140	47.724	85.604	19.036	31.016
		72.658	83.700	77.787	80.784	68.500	74.133	84.246	64.684	73.180	70.252	40.508	51.360	88.914	50.804	64.634
Hard	Ours	(0.419)	(0.710)	(0.411)	(0.645)	(0.691)	(0.477)	(0.722)	(0.610)	(0.561)	(3.600)	(2.035)	(2.334)	(1.160)	(2.175)	(1.803)

Table 8: Detailed comparison of our annotator against Face++ [42] and InsightFace [43] on A-FF++ dataset. ‘All’ denotes the averaged metrics across three levels of sample difficulty: ‘Easy,’ ‘Medium,’ and ‘Hard.’ Prediction mean and standard deviation (in parentheses) of each method across 5 random samplings and testings within each level are reported. The best results are shown in Bold.

Level	Method	A-DFDC [19]
		Female			Male			Young			Middle_Aged			Senior
		precision	recall	F1	precision	recall	F1	precision	recall	F1	precision	recall	F1	precision	recall	F1
		74.751	83.324	78.453	78.824	68.797	72.922	74.499	81.417	77.479	58.696	74.613	65.608	85.533	52.741	63.542
	Face++ [42]	(0.675)	(0.948)	(0.603)	(0.933)	(0.929)	(0.732)	(0.567)	(1.186)	(0.772)	(0.688)	(0.420)	(0.496)	(0.792)	(1.559)	(1.469)
		72.691	64.662	68.376	68.160	75.560	71.623	71.209	86.658	77.985	58.102	76.777	66.086	83.061	36.659	49.620
	InsightFace [43]	(0.649)	(1.208)	(0.671)	(0.779)	(0.754)	(0.453)	(0.852)	(0.703)	(0.676)	(1.040)	(0.370)	(0.797)	(0.989)	(1.802)	(1.833)
		74.805	88.720	80.839	85.137	67.864	74.834	88.669	76.413	81.695	91.753	77.845	83.772	90.797	77.147	82.823
All	Ours	(0.588)	(0.587)	(0.498)	(0.622)	(0.848)	(0.619)	(0.633)	(1.103)	(0.752)	(0.736)	(0.670)	(0.545)	(0.618)	(1.356)	(1.116)
		94.124	89.917	91.965	90.351	94.367	92.309	86.586	96.596	91.316	73.340	85.560	78.974	99.440	71.324	83.066
	Face++ [42]	(0.977)	(1.010)	(0.564)	(0.835)	(1.025)	(0.535)	(0.691)	(0.713)	(0.489)	(0.853)	(0.933)	(0.622)	(0.255)	(1.013)	(0.768)
		91.874	78.693	84.753	81.389	93.013	86.801	73.580	95.760	83.216	62.690	77.960	69.496	94.034	42.760	58.776
	InsightFace [43]	(0.986)	(1.798)	(0.744)	(1.140)	(1.059)	(0.384)	(0.821)	(0.196)	(0.552)	(0.659)	(0.898)	(0.746)	(1.262)	(1.209)	(1.189)
		95.289	97.413	96.337	97.353	95.173	96.248	97.448	99.240	98.334	94.902	96.996	95.938	99.730	89.844	94.528
Easy	Ours	(1.019)	(0.136)	(0.566)	(0.157)	(1.098)	(0.621)	(0.635)	(0.233)	(0.359)	(0.030)	(0.540)	(0.280)	(0.168)	(0.433)	(0.303)
		70.438	81.415	75.529	77.764	65.533	71.124	70.980	64.526	67.584	53.830	69.930	60.828	73.342	58.332	64.972
	Face++ [42]	(0.334)	(0.881)	(0.528)	(0.836)	(0.418)	(0.457)	(0.703)	(2.147)	(1.379)	(0.578)	(0.000)	(0.372)	(0.815)	(1.562)	(1.158)
		69.126	66.733	67.905	67.859	70.200	69.006	73.794	77.866	75.768	54.900	71.330	62.038	68.248	44.000	53.474
	InsightFace [43]	(0.316)	(1.285)	(0.813)	(0.788)	(0.221)	(0.345)	(1.105)	(1.205)	(0.966)	(1.229)	(0.000)	(0.780)	(0.711)	(2.241)	(1.769)
		69.072	95.067	80.011	92.097	57.433	70.746	89.568	67.934	77.250	94.220	69.246	79.824	82.306	82.014	82.158
Medium	Ours	(0.350)	(0.680)	(0.444)	(1.027)	(0.512)	(0.593)	(0.325)	(2.048)	(1.451)	(1.257)	(0.431)	(0.524)	(1.062)	(0.908)	(0.891)
		59.692	78.639	67.866	68.357	46.489	55.334	65.930	83.128	73.536	48.918	68.350	57.022	83.816	28.568	42.588
	Face++ [42]	(0.715)	(0.953)	(0.716)	(1.128)	(1.344)	(1.203)	(0.307)	(0.698)	(0.449)	(0.635)	(0.326)	(0.495)	(1.307)	(2.102)	(2.481)
		57.073	48.560	52.471	55.231	63.467	59.062	66.252	86.348	74.972	56.716	81.042	66.724	86.900	23.216	36.610
	InsightFace [43]	(0.647)	(0.543)	(0.456)	(0.409)	(0.983)	(0.629)	(0.631)	(0.707)	(0.511)	(1.233)	(0.211)	(0.864)	(0.995)	(1.955)	(2.540)
		60.054	73.680	66.170	65.961	50.987	57.509	78.990	62.066	69.500	86.136	67.294	75.554	90.354	59.582	71.784
Hard	Ours	(0.394)	(0.945)	(0.485)	(0.683)	(0.934)	(0.645)	(0.940)	(1.029)	(0.447)	(0.922)	(1.038)	(0.832)	(0.624)	(2.726)	(2.154)

Table 9: Detailed comparison of our annotator against Face++ [42] and InsightFace [43] on A-DFDC dataset. ‘All’ denotes the averaged metrics across three levels of sample difficulty: ‘Easy,’ ‘Medium,’ and ‘Hard.’ Prediction mean and standard deviation (in parentheses) of each method across 5 random samplings and testings within each level are reported. The best results are shown in Bold.

Level	Method	A-DFD[19]
		Female			Male
		precision	recall	F1	precision	recall	F1
		74.375	62.743	67.925	68.258	78.854	73.096
	Face++ [42]	(1.442)	(1.445)	(1.256)	(1.197)	(1.544)	(1.228)
		71.967	51.796	59.975	63.600	81.636	71.405
	InsightFace [43]	(1.062)	(1.161)	(1.053)	(0.651)	(0.863)	(0.660)
		72.884	83.547	77.548	78.938	66.418	71.615
All	Ours	(0.580)	(0.753)	(0.538)	(0.750)	(0.979)	(0.783)
		95.014	82.284	88.179	84.398	95.677	89.676
	Face++ [42]	(0.377)	(1.876)	(1.011)	(1.378)	(0.391)	(0.686)
		94.405	75.573	83.944	79.639	95.520	86.858
	InsightFace [43]	(0.777)	(0.956)	(0.794)	(0.683)	(0.634)	(0.596)
		94.434	93.600	94.013	93.659	94.480	94.066
Easy	Ours	(0.418)	(0.566)	(0.307)	(0.517)	(0.451)	(0.295)
		65.536	60.151	62.715	63.114	68.283	65.587
	Face++ [42]	(1.779)	(1.249)	(1.202)	(1.077)	(2.356)	(1.566)
		64.397	47.147	54.434	58.323	73.947	65.211
	InsightFace [43]	(1.071)	(1.433)	(1.310)	(0.769)	(0.646)	(0.671)
		65.158	86.133	74.188	79.536	53.920	64.258
Medium	Ours	(0.886)	(0.625)	(0.666)	(0.911)	(1.776)	(1.472)
		62.576	45.793	52.882	57.261	72.603	64.024
	Face++ [42]	(2.170)	(1.210)	(1.556)	(1.137)	(1.885)	(1.433)
		57.100	32.667	41.547	52.838	75.440	62.145
	InsightFace [43]	(1.339)	(1.096)	(1.054)	(0.501)	(1.310)	(0.714)
		59.062	70.907	64.442	63.618	50.853	56.520
Hard	Ours	(0.437)	(1.068)	(0.641)	(0.820)	(0.709)	(0.582)

Table 10: Detailed comparison of our annotator against Face++ [42] and InsightFace [43] on A-DFD dataset. ‘All’ denotes the averaged metrics across three levels of sample difficulty: ‘Easy,’ ‘Medium,’ and ‘Hard.’ Prediction mean and standard deviation (in parentheses) of each method across 5 random samplings and testings within each level are reported. The best results are shown in Bold.

		A-Celeb-DF-v2 [19]
		Female			Male
Level	Method	precision	recall	F1	precision	recall	F1
		97.624	83.302	89.455	86.517	98.318	91.819
	Face++ [42]	(0.815)	(0.727)	(0.438)	(0.494)	(0.640)	(0.453)
		97.442	85.842	91.041	88.079	98.007	92.635
	InsightFace [43]	(0.537)	(0.569)	(0.317)	(0.388)	(0.464)	(0.299)
		96.381	93.611	94.921	94.147	96.687	95.355
All	Ours	(0.756)	(0.564)	(0.400)	(0.422)	(0.782)	(0.426)
		99.889	96.480	98.155	96.598	99.893	98.218
	Face++ [42]	(0.055)	(0.418)	(0.236)	(0.392)	(0.053)	(0.222)
		99.837	98.107	98.964	98.140	99.840	98.983
	InsightFace [43]	(0.054)	(0.352)	(0.180)	(0.340)	(0.053)	(0.174)
		100.000	99.973	99.987	99.973	100.000	99.987
Easy	Ours	(0.000)	(0.053)	(0.027)	(0.053)	(0.000)	(0.027)
		99.732	89.227	94.185	90.260	99.760	94.771
	Face++ [42]	(0.199)	(0.952)	(0.558)	(0.787)	(0.177)	(0.460)
		99.639	88.320	93.638	89.513	99.680	94.323
	InsightFace [43]	(0.226)	(0.496)	(0.354)	(0.412)	(0.200)	(0.298)
		99.760	99.760	99.760	99.760	99.760	99.760
Medium	Ours	(0.053)	(0.177)	(0.100)	(0.176)	(0.053)	(0.100)
		93.251	64.200	76.024	72.694	95.300	82.469
	Face++ [42]	(2.190)	(0.812)	(0.519)	(0.304)	(1.691)	(0.679)
		92.850	71.100	80.521	76.584	94.500	84.600
	InsightFace [43]	(1.331)	(0.860)	(0.417)	(0.412)	(1.140)	(0.424)
		89.384	81.100	85.015	82.707	90.300	86.319
Hard	Ours	(2.214)	(1.463)	(1.073)	(1.037)	(2.294)	(1.150)

Table 11: Detailed comparison of our annotator against Face++ [42] and InsightFace [43] on A-Celeb-DF-v2 dataset. ‘All’ denotes the averaged metrics across three levels of sample difficulty: ‘Easy,’ ‘Medium,’ and ‘Hard.’ Prediction mean and standard deviation (in parentheses) of each method across 5 random samplings and testings within each level are reported. The best results are shown in Bold.

Level	Method	Adience [81]
		Female			Male			Young			Middle_Aged			Senior
		precision	recall	F1	precision	recall	F1	precision	recall	F1	precision	recall	F1	precision	recall	F1
		71.800	87.124	76.457	84.701	71.257	76.322	89.823	68.558	77.632	51.442	70.678	58.668	50.280	76.456	60.540
	Face++ [42]	(0.900)	(0.743)	(0.797)	(0.732)	(0.863)	(0.601)	(0.444)	(1.024)	(0.696)	(1.569)	(1.397)	(1.313)	(3.051)	(3.228)	(2.199)
		67.003	74.180	68.487	73.848	67.877	69.512	86.510	37.559	51.425	37.028	65.519	45.545	26.775	72.412	38.946
	InsightFace [43]	(0.687)	(1.140)	(0.650)	(1.068)	(0.722)	(0.583)	(1.147)	(1.079)	(1.225)	(0.888)	(1.304)	(0.884)	(1.672)	(2.321)	(2.045)
		78.103	94.416	83.852	96.915	82.496	88.602	77.517	81.767	79.578	48.206	37.327	41.843	67.352	69.370	68.241
All	Ours	(0.732)	(0.338)	(0.603)	(0.382)	(0.607)	(0.443)	(0.500)	(0.963)	(0.579)	(1.794)	(0.529)	(1.061)	(3.613)	(3.679)	(3.258)
		97.977	93.613	95.658	92.325	97.319	94.754	96.754	81.939	88.730	33.469	68.722	44.960	59.861	88.473	71.165
	Face++ [42]	(0.251)	(0.646)	(0.273)	(0.660)	(0.344)	(0.284)	(0.253)	(0.746)	(0.434)	(2.274)	(1.985)	(2.095)	(4.706)	(4.331)	(2.485)
		96.613	89.815	93.087	88.180	96.009	91.925	97.076	58.063	72.662	19.981	70.808	31.150	26.572	84.972	40.411
	InsightFace [43]	(0.210)	(1.000)	(0.537)	(0.980)	(0.331)	(0.540)	(0.455)	(0.819)	(0.717)	(0.942)	(2.276)	(1.185)	(2.943)	(2.722)	(3.521)
		99.822	99.974	99.898	99.969	99.776	99.872	93.651	97.159	95.372	57.920	37.414	45.375	97.721	96.831	97.221
Easy	Ours	(0.104)	(0.052)	(0.078)	(0.063)	(0.124)	(0.094)	(0.599)	(0.507)	(0.455)	(2.428)	(0.217)	(1.469)	(3.063)	(2.986)	(2.037)
		82.280	87.626	84.863	72.271	63.091	67.350	90.557	64.622	75.413	57.229	71.100	63.397	47.362	75.972	58.295
	Face++ [42]	(1.064)	(0.710)	(0.617)	(1.249)	(1.654)	(0.996)	(0.379)	(1.444)	(1.015)	(1.735)	(1.580)	(1.350)	(2.070)	(3.031)	(1.800)
		78.900	75.366	77.087	55.589	60.493	57.923	83.417	30.372	44.510	40.618	61.066	48.776	26.303	65.071	37.453
	InsightFace [43]	(1.250)	(0.827)	(0.818)	(0.852)	(1.375)	(0.637)	(2.029)	(2.139)	(2.587)	(0.934)	(0.713)	(0.618)	(1.300)	(2.274)	(1.616)
		92.671	99.184	95.816	98.159	84.639	90.897	76.159	80.899	78.455	40.740	34.599	37.409	69.317	69.799	69.532
Medium	Ours	(0.607)	(0.280)	(0.385)	(0.599)	(0.735)	(0.505)	(0.372)	(0.913)	(0.492)	(1.307)	(0.808)	(0.812)	(5.848)	(5.737)	(5.645)
		35.144	80.134	48.851	89.507	53.361	66.860	82.159	59.114	68.752	63.628	72.212	67.646	43.617	64.923	52.162
	Face++ [42]	(1.384)	(0.874)	(1.502)	(0.287)	(0.591)	(0.522)	(0.701)	(0.882)	(0.638)	(0.697)	(0.626)	(0.495)	(2.376)	(2.322)	(2.311)
		25.495	57.358	35.287	77.776	47.129	58.688	79.037	24.242	37.102	50.485	64.681	56.708	27.452	67.194	38.973
	InsightFace [43]	(0.602)	(1.594)	(0.596)	(1.373)	(0.460)	(0.572)	(0.956)	(0.280)	(0.372)	(0.789)	(0.923)	(0.848)	(0.773)	(1.967)	(0.999)
		41.818	84.089	55.842	92.619	63.072	75.038	62.742	67.243	64.907	45.958	39.968	42.745	35.018	41.479	37.972
Hard	Ours	(1.484)	(0.684)	(1.345)	(0.486)	(0.962)	(0.730)	(0.530)	(1.468)	(0.790)	(1.648)	(0.563)	(0.902)	(1.929)	(2.314)	(2.092)

Table 12: Detailed comparison of our annotator against Face++ [42] and InsightFace [43] on Adience dataset. ‘All’ denotes the averaged metrics across three levels of sample difficulty: ‘Easy,’ ‘Medium,’ and ‘Hard.’ Prediction mean and standard deviation (in parentheses) of each method across 5 random samplings and testings within each level are reported. The best results are shown in Bold.

A.2 Phase2: Demographically Annotation Generation

A.2.1 Detailed Information of Datasets

Methods	#Samples	FFHQ	CASIA-WebFace	IMDB-WIKI	CelebA	A-FF+	A-DFDC	A-DFD	A-Celeb-DF-v2
		[6]	[37]	[38]	[36]	(Real) [80]	(Real) [80]	(Real) [80]	(Real) [80]
A-FF++ [2]	105K					✓
A-DFDC [39]	37K						✓
A-DFD [40]	31K							✓
A-Celeb-DF-v2 [41]	155K								✓
AttGAN [49]	6K				✓
MMDGAN [50]	1K				✓
StarGAN [49]	5.6K				✓
StyleGAN [49]	10K	✓
StyleGAN2 [51]	118K	✓
StyleGAN3 [52]	26.7K	✓
MSG-StyleGAN [50]	1K				✓
ProGAN [53]	100K				✓
STGAN [50]	1K				✓
VQGAN [54]	50K	✓
DALLE2 [55]	204				✓
IF [55]	505				✓
Midjourney [55]	100				✓
DCFace [56]	529K		✓
Latent Diffusion [57]	20K	✓
Palette [58]	6K				✓
SD v1.5 [59]	18K			✓
SD Inpainting [59]	20.9K			✓
Total	1,245,660	70,000	474,876	26,788	202,502	21,593	37,836	8,856	23,645
Total	1,245,660	866,096

Table 13: Number of real and fake images from different fake image datasets and their corresponding real image sources.

Table 13 shows the detailed information of all subsets we collected and incorporated into our AI-Face dataset. It covers fake facial images from deepfake videos, generated from GANs and DMs. The corresponding real sources of most AI-generated face subsets are FFHQ [6] and CelebA [36]. In general, our AI-Face dataset contains 30 subsets (22 fake subsets and 8 real subsets) and 37 generation methods ( methods are summed as 5 in A-FF++, 5 in A-DFD, 8 in A-DFDC, 1 in A-Celeb-DF-v2, 10 GANs, and 8 DMs), including a total of 1,245,660 fake face images and 866,096 real face images. Fig. A.1 visualizes face images of each subset. Fig. A.2 further demonstrates the detailed demographic distribution of our AI-Face dataset. The dataset is relatively gender-balanced, and the subjects are majorly young and white individuals.

A.2.2 Details of Threshold Settings for Human Correction

In this section, we present uncertainty score distributions of each attribute (i.e., Gender, Age, and Race) of each subset in our AI-Face dataset, as shown from Fig. A.4 to Fig. A.31. Overall, our annotator shows higher confidence in predicting gender attributes compared to predicting age, as observed from these uncertainty score distributions. It is clear that different subsets show different distributions, so we dynamically adjust the threshold $t^{{a}_{j}}$ for each attribute $a$ on subset $j$ defined in ‘Human Correction’ in Section 3.2. First, we fit the distribution with gamma distribution and calculate its mean and standard deviation. Then, the $t^{{a}_{j}}$ is calculated using $Mean+\lambda Std$ . After getting the threshold, we can get the total image number within each subset that needed human correction. We assume it takes three seconds for a human to correct one annotation for one image, then we can calculate the total time needed for a human to correct these images beyond the threshold. Therefore, The $\lambda$ is dynamically adjusted based on the distribution and the total time needed for human correction.

A.2.3 Examples of Mislabeled images in A-FF++ and A-DFDC

In the evaluation results for Q1 in Section 3.2, we have validated that we cannot directly incorporate existing annotations into our AI-Face dataset. Fig. A.3 displays some image examples where annotations in A-FF++ [80] and A-DFDC [80] are inconsistent with the annotations given by our annotator. A-FF and A-DFDC have mislabeled annotations for ambiguous facial images, whereas our annotator can accurately predict them. This visualization of images further validates that existing annotations cannot be directly merged into our dataset.

A.2.4 Additional Results of Validating the Effectiveness of Human Correction Strategy

Since A-DFD [19] and A-Celeb-DF-v2 [19] provide gender annotation, we can compare our two versions of datasets with it. One is Ours before human correction (i.e., Ours-DFD(w/o Correction) and Ours-A-Celeb-DF-v2(w/o Correction)), another one is Ours after human correction (i.e., Ours-DFD (Correction) and Ours-A-Celeb-DF-v2 (Correction)). As same setting as in evaluation for Q2 Section 3.2, we sample 1,200 attribute-balanced images (400 easy, 400 medium, and 400 hard) based on uncertainty score. Three humans re-annotated these images to establish ground truth. As shown in Table 14, Ours-DFD (Correction) and Ours-A-Celeb-DF-v2 (Correction) outperforms ours without correction version and A-DFD and A-Celeb-DF-v2 (e.g., the accuracy of Ours-DFD (Correction) is 22.866% higher than A-DFD and 13.526% higher than Ours-DFD(w/o Correction)). This suggests that our dataset annotation quality is much better than the existing annotation in A-DFD [19] and A-Celeb-DF-v2 [19]. And our human correction strategy further improves our dataset annotation quality.

Gender
	ACC	Precision	Recall	F1
A-DFD [19]	70.612	71.347	74.245	69.900
Ours-DFD(w/o Correction)	79.952	79.868	83.979	79.308
Ours-DFD (Correction)	93.478	91.673	95.034	92.898
A-Celeb-DF-v2 [19]	89.697	90.622	90.622	89.697
Ours-A-Celeb-DF-v2(w/o Correction)	91.414	91.404	91.831	91.391
Ours-A-Celeb-DF-v2 (Correction)	93.535	93.655	94.087	93.525

Table 14: Evaluation results to demonstrate the effectiveness of human correction strategy.

Appendix B Fairness Benchmark

B.1 Details of Detection Methods

Model Type	Detector	Backbone	GitHub Link	VENUE
Naive	Xception [61]	Xception	https://github.com/ondyari/FaceForensics/blob/master	ICCV-2019
	Efficient-B4 [62]	EfficientNet	https://github.com/lukemelas/EfficientNet-PyTorch	ICML-2019
	ViT-B/16 [63]	Transformer	https://github.com/lucidrains/vit-pytorch	ICLR-2021
Spatial	UCF [16]	Xception	https://github.com/SCLBD/DeepfakeBench/tree/main	ICCV-2023
	UnivFD [67]	CLIP VIT	https://github.com/Yuheng-Li/UniversalFakeDetect	CVPR-2023
	CORE [68]	Xception	https://github.com/niyunsheng/CORE	CVPRW-2022
Frequency	F3Net [64]	Xception	https://github.com/yyk-wew/F3Net	ECCV-2020
	SRM [66]	Xception	https://github.com/SCLBD/DeepfakeBench/tree/main	CVPR-2021
	SPSL [65]	Xception	https://github.com/SCLBD/DeepfakeBench/tree/main	CVPR-2021
Fairness- enhanced	DAW-FDD [20]	Xception	Unpublished code, reproduced by us	WACV-2024
	DAG-FDD [20]	Xception	Unpublished code, reproduced by us	WACV-2024
	PG-FDD [21]	Xception	https://github.com/Purdue-M2/Fairness-Generalization	CVPR-2024

Table 15: Summary of the implemented detectors in our fairness benchmark.

Xception [61]: is a deep convolutional neural network (CNN) architecture that relies on depthwise separable convolutions. This approach significantly reduces the number of parameters and computational cost while maintaining high performance. Xception serves as a classic backbone in deepfake detectors.

EfficientB4 [62]: is part of the EfficientNet family [62], which utilizes a novel model scaling method that uniformly scales all dimensions of depth, width, and resolution using a compound coefficient. EfficientNet also serves as a classic backbone in deepfake detectors.

ViT-B/16 [63]: is a model that applies the transformer architecture, the ’B’ denotes the base model size, and ’16’ indicates the patch size. ViT-B/16 splits images into 16 patches, linearly embeds each patch, adds positional embeddings, and feeds the resulting sequence of vectors into a standard transformer encoder.

F3Net [64]: utilizes a cross-attention two-stream network to effectively identify frequency-aware clues by integrating two branches: FAD and LFS. The FAD (Frequency-aware Decomposition) module divides the input image into various frequency bands using learnable partitions, representing the image with frequency-aware components to detect forgery patterns through this decomposition. Meanwhile, the LFS (Localized Frequency Statistics) module captures local frequency statistics to highlight statistical differences between authentic and counterfeit faces.

SPSL [65]: integrates spatial image data with the phase spectrum to detect up-sampling artifacts in face forgeries, enhancing the model’s generalization ability for face forgery detection. The paper provides a theoretical analysis of the effectiveness of using the phase spectrum. Additionally, it highlights that local texture information is more important than high-level semantic information for accurately detecting face forgeries.

SRM [66]: extracts high-frequency noise features and combines two different representations from the RGB and frequency domains to enhance the model’s generalization ability for face forgery detection.

UCF [16]: presents a multi-task disentanglement framework designed to tackle two key challenges in deepfake detection: overfitting to irrelevant features and overfitting to method-specific textures. By identifying and leveraging common features, this framework aims to improve the model’s generalization ability.

UnivFD [67]: uses the frozen CLIP ViT-L/14 [44] as feature extractor and trains the last linear layer to classify fake and real images.

CORE [68]: explicitly enforces the consistency of different representations. It first captures various representations through different augmentations and then regularizes the cosine distance between these representations to enhance their consistency.

DAW-FDD [20]: a demographic-aware Fair Deepfake Detection (DAW-FDD) method leverages demographic information and employs an existing fairness risk measure [82]. At a high level, DAW-FDD aims to ensure that the losses achieved by different user-specified groups of interest (e.g., different races or genders) are similar to each other (so that the AI face detector is not more accurate on one group vs another) and, moreover, that the losses across all groups are low. Specifically, DAW-FDD uses a CVaR [83, 84] loss function across groups (to address imbalance in demographic groups) and, per group, DAW-FDD uses another CVaR loss function (to address imbalance in real vs AI-generated training examples).

DAG-FDD [20]: a demographic-agnostic Fair Deepfake Detection (DAG-FDD) method, which is based on the distributionally robust optimization (DRO) [85, 86]. To use DAG-FDD, the user does not have to specify which attributes to treat as sensitive such as race and gender, only need to specify a probability threshold for a minority group without explicitly identifying all possible groups.

PG-FDD [21]: PG-FDD (Preserving Generalization Fair Deepfake Detection) employs disentanglement learning to extract demographic and domain-agnostic forgery features, promoting fair learning across a flattened loss landscape. Its framework combines disentanglement learning, fairness learning, and optimization modules. The disentanglement module introduces a loss to expose demographic and domain-agnostic features that enhance fairness generalization. The fairness learning module combines these features to promote fair learning, guided by generalization principles. The optimization module flattens the loss landscape, helping the model escape suboptimal solutions and strengthen fairness generalization.

B.2 Implementation Details

For fairness benchmark, all experiments are based on the PyTorch with a single NVIDIA RTX A6000 GPU. During training, we utilize SGD optimizer with a learning rate of 0.0005, with momentum of 0.9 and weight decay of 0.005. The batch size is set to 128 for most detectors. However, for the SRM [66], UCF [16], and PG-FDD [21], the batch size is adjusted to 32 due to GPU memory. For hyperparameters defined in these detectors, we use the default values set in their original papers. All detectors are initialized with their official pre-trained weights, and trained for 5 epochs.

B.3 Fairness Metrics

We assume a test set comprising indices {1, …, $n$ }. $Y_{j}$ and $\hat{Y}_{j}$ respectively represent the true and predicted labels of the sample $X_{j}$ . Their values are binary, where 0 means real and 1 means fake. For all fairness metrics, a lower value means better performance.

	$\displaystyle F_{EO}:=\sum_{\mathcal{J}_{j}\in\mathcal{J}}\sum_{q=0}^{1}\left\|% \frac{\sum_{j=1}^{n}\mathbb{I}_{\left[\hat{Y}_{j}=1,D_{j}=\mathcal{J}_{j},Y_{j% }=q\right]}}{\sum_{j=1}^{n}\mathbb{I}_{\left[D_{j}=\mathcal{J}_{j},Y_{j}=q% \right]}}-\frac{\sum_{j=1}^{n}\mathbb{I}_{\left[\hat{Y}_{j}=1,Y_{j}=q\right]}}% {\sum_{j=1}^{n}\mathbb{I}_{\left[Y_{j}=q\right]}}\right\|,$
	$\displaystyle F_{O\!A\!E}:=\max_{\mathcal{J}_{j}\in\mathcal{J}}\left\{\frac{% \sum_{j=1}^{n}\mathbb{I}_{[\hat{Y}_{j}=Y_{j},D_{j}=\mathcal{J}_{j}]}}{\sum_{j=% 1}^{n}\mathbb{I}_{[D_{j}=\mathcal{J}_{j}]}}\right.\quad\left.-\min_{{\mathcal{% J}_{j}}^{\prime}\in\mathcal{J}}\frac{\sum_{j=1}^{n}\mathbb{I}_{[\hat{Y}_{j}=Y_% {j},D_{j}={\mathcal{J}_{j}}^{\prime}]}}{\sum_{j=1}^{n}\mathbb{I}_{[D_{j}={% \mathcal{J}_{j}}^{\prime}]}}\right\},$
	$\displaystyle F_{DP}:=\max_{q\in\{0,1\}}\left\{\max_{J_{j}\in\mathcal{J}}\frac% {\sum_{j=1}^{n}\mathbb{I}_{[\hat{Y}_{j}=q,D_{j}=J_{j}]}}{\sum_{j=1}^{n}\mathbb% {I}_{[D_{j}=J_{j}]}}\right.\quad\left.-\min_{J_{j}^{\prime}\in\mathcal{J}}% \frac{\sum_{j=1}^{n}\mathbb{I}_{[\hat{Y}_{j}=q,D_{j}=J_{j}^{\prime}]}}{\sum_{j% =1}^{n}\mathbb{I}_{[D_{j}=J_{j}^{\prime}]}}\right\},$
	$\displaystyle F_{M\!E\!O}:=\max_{q,q^{\prime}\in\{0,1\}}\left\{\max_{J_{j}\in% \mathcal{J}}\frac{\sum_{j=1}^{n}\mathbb{I}_{[\hat{Y}_{j}=q,Y_{j}=q^{\prime},D_% {j}=J_{j}]}}{\sum_{j=1}^{n}\mathbb{I}_{[D_{j}=J_{j},Y_{j}=q]}}\right.\quad% \left.-\min_{J_{j}^{\prime}\in\mathcal{J}}\frac{\sum_{j=1}^{n}\mathbb{I}_{[% \hat{Y}_{j}=q,Y_{j}=q^{\prime},D_{j}=J_{j}^{\prime}]}}{\sum_{j=1}^{n}\mathbb{I% }_{[D_{j}=J_{j}^{\prime},Y_{j}=q]}}\right\},$
	$\displaystyle F_{IND}:=\sum_{j=1}^{n}\sum_{l=j+1}^{n}\mathbb{I}_{[\left\|f(X_{j% })-f(X_{l})\right\|-\delta\\|X_{j}-X_{l}\\|]},$
	$\displaystyle\text{Avg-}F_{R}:=\frac{1}{\|F\|}\sum_{f\in F}R_{m,f},\text{Avg-}F_% {MR}:=\frac{1}{\|M_{t}\|}\sum_{m\in M_{t}}\text{Avg-}F_{R}.$

Where $D$ is the demographic variable, $\mathcal{J}$ is the set of subgroups with each subgroup $\mathcal{J}_{j}\in\mathcal{J}$ . $M$ is the set of detection models and $F$ is the set of fairness metrics. $R_{m,f}$ is the rank of detection model $m\in M$ for fairness metric $f\in F$ . $|F|$ is the total number of fairness metrics. $T$ is the set of model types, and $M_{t}$ is the set of detection models within model type $t\in T$ . $|M_{t}|$ is the total number of detection models within model type $t$ . $F_{EO}$ measures the disparity in TPR or FPR between each subgroup and the overall population. $F_{OAE}$ measures the maximum ACC gap across all demographic groups. $F_{DP}$ measures the maximum difference in prediction rates across all demographic groups. And $F_{MEO}$ captures the largest disparity in prediction outcomes (either positive or negative) when comparing different demographic groups. $\delta$ in $F_{IND}$ is a predefined scale factor (0.06 in our experiments). $f(X_{j})$ represents the predicted logits of the model for input sample $X_{j}$ . $F_{IND}$ points that a model should be fair across individuals if similar individuals have similar predicted outcomes. $\text{Avg-}F_{R}$ is the average fairness rank of detection model $m$ , $\text{Avg-}F_{MR}$ is the average fairness rank of a model type.

B.4 Full Subsets Evaluation Results

Detailed test results of each subset as shown from Table 16 to Table 35 are presented in this section. The findings align with the results reported in Fig. 4.

Model Type

Naive

Frequency

Spatial

Fairness-enhanced

Dataset

Attribute

Metric

Xception

[61]

EfficientB4

[62]

ViT-B/16

[63]

F3Net

[64]

SPSL

[65]

SRM

[66]

UCF

[16]

UnivFD

[67]

CORE

[68]

DAW-FDD

[20]

DAG-FDD

[20]

PG-FDD

[21]

FF++

Gender

F_{MEO}

4.353

3.346

1.161

2.595

0.887

2.492

4.916

10.873

2.516

12.198

1.606

2.214

F_{DP}

1.250

1.096

0.276

0.601

0.392

0.409

1.231

2.874

0.61

2.722

1.024

0.772

F_{OAE}

0.177

0.132

0.396

0.426

0.231

0.228

0.489

0.015

0.196

0.977

0.941

0.095

F_{EO}

4.749

4.293

1.335

2.728

1.012

2.839

5.117

12.323

3.221

12.993

2.969

2.362

Race

F_{MEO}

10.304

9.813

7.630

15.051

6.844

22.26

9.791

23.588

4.564

22.598

2.607

15.657

F_{DP}

3.562

9.544

3.485

3.22

5.864

3.516

5.554

12.934

8.75

8.954

6.65

2.943

F_{OAE}

4.465

7.396

6.232

4.045

2.541

5.227

4.388

6.889

3.522

7.382

1.939

3.764

F_{EO}

17.066

27.404

12.835

20.586

11.277

36.221

17.288

70.499

11.944

48.644

6.882

18.386

Age

F_{MEO}

9.851

5.348

5.984

6.204

3.622

14.005

9.692

15.205

9.423

24.413

1.857

6.136

F_{DP}

2.887

4.708

1.280

5.661

6.479

7.919

6.196

9.205

7.693

4.346

6.221

4.339

F_{OAE}

1.038

5.813

6.417

2.049

0.856

1.581

2.606

8.927

1.138

4.263

1.472

2.112

F_{EO}

18.191

11.876

8.199

13.665

5.636

20.696

17.291

29.607

14.781

47.613

6.419

12.446

Intersection

F_{MEO}

28.949

16.662

11.994

18.672

8.505

30.828

19.132

54.201

8.784

39.858

5.130

16.994

F_{DP}

11.648

12.215

5.127

6.721

10.157

4.449

11.268

32.584

14.697

20.864

10.087

4.831

F_{OAE}

8.442

10.876

10.295

7.210

4.868

8.742

5.638

15.415

8.209

10.843

3.322

4.491

F_{EO}

70.162

68.005

32.625

48.53

25.922

78.296

40.971

169.535

33.428

131.755

19.399

38.887

ACC

92.280

89.282

86.051

94.832

93.676

92.587

94.982

83.652

95.183

91.420

93.254

96.237

AUC

95.605

91.281

83.542

97.878

97.820

96.164

98.115

76.839

98.147

94.618

97.996

98.245

99.207

98.381

96.712

99.631

99.619

99.29

99.668

95.321

99.684

99.011

99.658

99.681

EER

10.951

16.807

24.299

6.756

6.565

9.888

6.02

30.755

5.993

12.449

6.429

7.273

Table 16: Detailed fairness and utility evaluation results on FF++.

Model Type

Naive

Frequency

Spatial

Fairness-enhanced

Dataset

Attribute

Metric

Xception

[61]

EfficientB4

[62]

ViT-B/16

[63]

F3Net

[64]

SPSL

[65]

SRM

[66]

UCF

[16]

UnivFD

[67]

CORE

[68]

DAW-FDD

[20]

DAG-FDD

[20]

PG-FDD

[21]

DFDC

Gender

F_{MEO}

6.011

3.708

5.567

4.415

2.357

8.711

1.492

7.444

2.062

4.87

2.944

3.687

F_{DP}

5.878

0.998

3.43

4.959

3.829

8.35

3.776

5.039

3.662

5.271

4.468

6.348

F_{OAE}

2.97

2.744

1.88

2.645

2.537

2.438

1.427

0.075

2.178

3.012

2.233

0.869

F_{EO}

6.222

5.564

7.742

5.609

3.833

10.841

2.517

12.51

3.784

4.898

3.95

4.968

Race

F_{MEO}

8.525

6.846

24.319

7.667

9.726

9.139

11.603

22.342

10.74

15.529

11.403

5.992

F_{DP}

21.619

11.534

20.596

25.03

24.594

21.463

25.9

24.317

26.634

23.997

25.534

24.613

F_{OAE}

1.622

3.701

12.756

3.048

2.816

5.793

4.722

11.051

2.699

14.659

3.46

2.523

F_{EO}

26.728

15.611

47.784

25.2

25.744

24.09

22.014

65.788

23.784

47.679

26.64

12.268

Age

F_{MEO}

6.193

7.721

17.868

5.022

7.375

10.382

4.608

13.078

5.683

20.119

5.578

3.96

F_{DP}

11.068

4.752

14.277

9.967

12.117

8.172

11.987

9.764

9.112

13.229

10.702

11.48

F_{OAE}

2.817

5.951

4.984

3.918

2.585

6.092

2.513

7.523

3.869

12.581

2.498

1.653

F_{EO}

14.397

16.233

26.396

14.327

13.88

16.625

12.03

22.816

11.274

31.018

8.736

6.954

Intersection

F_{MEO}

14.479

15.029

33.979

14.067

24.924

14.117

16.119

38.533

17.421

20.447

18.268

10.973

F_{DP}

28.877

17.816

30.153

32.117

31.493

27.666

31.604

28.815

33.791

27.812

30.224

31.389

F_{OAE}

5.619

8.088

20.771

4.456

7.453

9.306

5.922

14.994

4.423

18.877

5.642

3.832

F_{EO}

72.695

60.893

111.03

59.238

67.19

60.749

63.262

133.283

58.174

90.761

64.155

33.495

ACC

81.223

71.939

71.044

87.658

87.482

83.536

89.155

64.164

88.75

81.452

88.867

92.905

AUC

90.395

80.17

81.942

95.158

95.789

91.837

96.025

72.228

95.65

91.695

95.916

97.014

91.284

81.442

82.547

95.764

96.313

92.435

96.567

75.304

96.219

92.37

96.447

97.081

EER

18.443

28.133

26.271

12.367

10.805

15.588

10.818

33.542

10.927

17.043

10.709

8.317

Table 17: Detailed fairness and utility evaluation results on DFDC.

Model Type

Naive

Frequency

Spatial

Fairness-enhanced

Dataset

Attribute

Metric

Xception

[61]

EfficientB4

[62]

ViT-B/16

[63]

F3Net

[64]

SPSL

[65]

SRM

[66]

UCF

[16]

UnivFD

[67]

CORE

[68]

DAW-FDD

[20]

DAG-FDD

[20]

PG-FDD

[21]

DFD

Gender

F_{MEO}

7.052

2.139

3.857

6.07

0.269

10.037

6.095

1.039

2.257

3.605

3.059

0.717

F_{DP}

5.543

5.864

1.231

4.871

7.154

2.188

5.342

2.327

7.261

6.893

8.593

6.827

F_{OAE}

3.445

3.199

5.624

2.212

0.232

4.177

2.313

5.785

2.198

2.996

2.609

1.241

F_{EO}

8.657

3.812

3.868

6.325

0.467

10.381

6.731

1.300

4.139

5.855

3.900

1.326

Race

F_{MEO}

5.975

6.844

12.306

5.574

0.319

18.91

6.141

20.641

6.292

10.597

6.021

11.641

F_{DP}

25.863

19.116

11.976

25.678

40.64

21.081

28.174

14.104

26.949

28.784

28.439

29.743

F_{OAE}

6.002

10.714

16.754

4.602

0.206

8.565

3.89

15.842

4.917

4.594

4.125

3.797

F_{EO}

16.002

17.914

24.819

14.628

0.884

32.788

15.477

51.098

17.872

19.959

13.855

17.467

Age

F_{MEO}

14.485

13.629

2.744

9.24

0.9

9.38

10.383

6.69

10.768

10.107

10.942

5.6

F_{DP}

34.386

18.578

10.063

32.355

20.119

18.826

33.553

4.892

34.865

32.253

34.503

27.165

F_{OAE}

11.001

18.255

22.847

6.943

0.434

13.797

7.41

23.7

5.315

7.635

6.256

6.095

F_{EO}

22.487

33.616

6.473

15.786

1.97

13.896

18.326

12.859

16.44

14.035

13.272

8.349

Intersection

F_{MEO}

15.691

37.9

20.833

13.62

1.786

27.246

11.053

35.828

18.056

20.833

9.157

12.903

F_{DP}

35.824

31.581

18.295

36.56

53.771

29.097

38.828

28.054

38.536

41.172

39.027

41.388

F_{OAE}

9.913

15.939

21.972

6.863

1.322

11.216

6.327

22.706

7.158

6.101

6.097

5.31

F_{EO}

46.408

79.825

68.93

42.743

7.073

91.155

41.325

111.273

53.592

49.779

40.678

41.822

ACC

93.039

88.321

83.862

94.6

99.505

91.405

94.984

80.753

94.761

92.99

94.6

97.102

AUC

97.507

93.914

89.886

98.478

99.942

96.347

98.592

82.817

98.651

97.659

98.813

99.082

99.349

98.366

97.059

99.596

99.965

98.929

99.614

95.008

99.62

99.375

99.687

99.75

EER

8.086

13.377

18.014

6.183

0.500

10.048

6.124

24.911

5.945

7.788

5.529

5.470

Table 18: Detailed fairness and utility evaluation results on DFD.

Model Type

Naive

Frequency

Spatial

Fairness-enhanced

Dataset

Attribute

Metric

Xception

[61]

EfficientB4

[62]

ViT-B/16

[63]

F3Net

[64]

SPSL

[65]

SRM

[66]

UCF

[16]

UnivFD

[67]

CORE

[68]

DAW-FDD

[20]

DAG-FDD

[20]

PG-FDD

[21]

Celeb- DF-v2

Gender

F_{MEO}

1.764

8.434

10.889

0.584

2.701

2.377

2.645

13.511

2.78

1.312

0.997

1.584

F_{DP}

6.072

7.227

0.405

6.219

8.541

7.023

9.706

6.693

6.071

7.104

6.023

F_{OAE}

1.578

2.063

5.663

1.092

2.149

0.976

0.884

8.053

0.636

2.411

0.831

0.429

F_{EO}

2.585

10.238

11.073

1.108

3.236

3.379

3.564

20.484

3.369

1.599

1.519

1.601

Race

F_{MEO}

5.583

9.879

14.539

7.288

8.943

9.753

4.16

32.306

4.502

21.999

8.275

7.45

F_{DP}

19.474

19.627

14.411

21.812

24.643

16.882

20.953

12.222

21.337

22.694

24.787

16.744

F_{OAE}

6.569

9.664

12.618

4.032

6.035

5.493

3.524

10.82

3.813

3.092

5.392

2.815

F_{EO}

10.691

25.652

28.759

13.013

11.726

15.42

14.684

63.524

9.671

58.225

14.714

12.384

Age

F_{MEO}

7.172

7.331

15.248

6.974

1.948

8.784

3.873

29.904

3.539

5.903

2.508

5.968

F_{DP}

33.004

25.16

6.737

33.891

33.072

33.648

32.236

18.794

32.986

24.932

34.577

32.264

F_{OAE}

1.925

8.576

26.359

1.628

1.532

1.149

2.502

12.526

3.482

10.577

0.845

1.183

F_{EO}

11.497

14.073

19.966

11.657

5.013

11.178

7.669

53.72

9.685

10.027

5.404

7.037

Intersection

F_{MEO}

32.79

57.779

14.286

28.571

16.19

14.286

58.368

16.774

25.477

14.286

12.381

F_{DP}

76.368

78.595

67.795

76.672

77.839

77.371

76.863

67.188

76.881

77.935

75.761

77.349

F_{OAE}

19.231

16.228

49.562

11.538

8.463

7.334

7.692

29.689

11.538

5.769

7.692

5.769

F_{EO}

71.129

114.538

103.126

53.655

59.887

59.694

60.765

182.729

61.653

141.381

48.621

33.495

ACC

97.43

95.129

91.548

98.145

97.511

98.073

98.263

88.191

98.221

96.073

98.405

98.754

AUC

99.345

97.548

96.504

99.652

99.579

99.448

99.684

83.086

99.685

98.377

99.702

99.815

99.908

99.641

99.492

99.953

99.943

99.923

99.957

97.068

99.957

99.763

99.96

99.974

EER

3.733

8.041

9.747

2.189

2.074

2.857

2.051

25.184

2.143

6.382

1.636

2.281

Table 19: Detailed fairness and utility evaluation results on Celeb-DF-v2.

Model Type

Naive

Frequency

Spatial

Fairness-enhanced

Dataset

Attribute

Metric

Xception

[61]

EfficientB4

[62]

ViT-B/16

[63]

F3Net

[64]

SPSL

[65]

SRM

[66]

UCF

[16]

UnivFD

[67]

CORE

[68]

DAW-FDD

[20]

DAG-FDD

[20]

PG-FDD

[21]

AttGAN

Gender

F_{MEO}

0.56

1.669

0.472

0.946

0.459

1.554

0.422

4.79

3.544

1.941

0.153

1.47

F_{DP}

10.923

12.249

10.678

11.069

11.068

10.489

11.295

9.998

12.045

11.936

11.171

12.205

F_{OAE}

0.619

0.287

0.739

1.053

0.432

1.136

0.085

3.38

0.75

0.721

0.165

0.288

F_{EO}

1.096

3.05

0.816

1.695

0.617

2.078

0.676

6.049

3.62

2.379

0.177

2.275

Race

F_{MEO}

3.39

3.613

3.228

3.198

3.918

4.013

3.03

18.643

17.655

7.576

1.887

1.695

F_{DP}

11.859

13.88

11.05

12.876

10.628

11.582

13.059

22.615

16.834

13.753

13.054

13.502

F_{OAE}

1.587

2.174

2.526

2.387

2.033

2.31

2.521

4.713

5.636

5.042

1.6

F_{EO}

5.016

10.539

9.994

5.975

9.472

9.239

5.592

37.89

19.269

12.917

4.291

5.117

Age

F_{MEO}

3.086

4.899

7.144

1.194

3.704

3.096

2.469

15.211

5.996

2.855

2.206

5.493

F_{DP}

22.439

21.386

18.175

23.439

22.14

22.789

24.491

20.473

21.175

22.14

24.789

21.193

F_{OAE}

1.105

3.595

4.689

0.942

2.563

3.132

2.456

6.436

4.493

2.309

0.398

3.758

F_{EO}

5.209

10.255

13.136

4.103

10.371

10.312

5.807

36.092

8.639

6.932

3.746

7.407

Intersection

F_{MEO}

5.128

11.111

7.692

7.407

6.667

7.407

31.774

33.333

7.692

3.125

F_{DP}

20.594

24.253

20.152

21.106

20.783

19.375

22.003

28.514

28.753

21.677

22.003

23.411

F_{OAE}

4.225

7.042

4.968

5.634

3.177

4.878

4.348

16.17

10.976

5.479

2.817

1.852

F_{EO}

21.471

42.762

35.107

23.6

25.943

31.389

22.546

92.264

46.215

33.053

12.897

17.18

ACC

98.482

97.884

95.86

98.62

98.482

98.666

98.712

80.957

96.274

97.608

99.264

99.126

AUC

99.798

99.526

99.259

99.776

99.702

99.642

99.875

89.719

98.721

99.722

99.781

99.953

99.795

99.492

99.282

99.797

99.612

99.587

99.888

91.76

98.646

99.732

99.827

99.958

EER

1.594

2.092

4.084

1.494

1.394

1.195

18.426

4.98

2.39

0.996

1.494

Table 20: Detailed fairness and utility evaluation results on AttGAN.

Model Type

Naive

Frequency

Spatial

Fairness-enhanced

Dataset

Attribute

Metric

Xception

[61]

EfficientB4

[62]

ViT-B/16

[63]

F3Net

[64]

SPSL

[65]

SRM

[66]

UCF

[16]

UnivFD

[67]

CORE

[68]

DAW-FDD

[20]

DAG-FDD

[20]

PG-FDD

[21]

MMDGAN

Gender

F_{MEO}

2.722

17.144

0.773

1.087

3.809

3.261

4.622

1.626

3.394

8.439

11.417

2.448

F_{DP}

8.077

16.007

6.925

7.801

8.077

7.939

9.604

6.787

9.584

11.082

13.928

9.141

F_{OAE}

2.335

7.348

1.084

1.271

3.398

3.26

2.797

0.63

0.512

4.275

4.994

1.133

F_{EO}

4.772

18.286

0.773

2.028

6.9

6.352

5.63

3.071

4.481

9.447

12.493

2.448

Race

F_{MEO}

73.95

8.974

16.667

14.286

8.333

33.333

16.667

28.571

2.21

F_{DP}

F_{OAE}

9.091

5.233

9.091

4.545

18.182

9.091

1.187

F_{EO}

23.462

91.198

13.141

22.352

28.661

15.508

19.122

48.478

25.201

39.585

44.593

5.931

Age

F_{MEO}

11.706

22.297

4.808

10.345

7.642

6.924

9.091

14.336

14.559

9.091

F_{DP}

17.703

20.303

10.909

9.394

11.515

11.212

11.818

10.303

12.727

8.788

12.727

11.818

F_{OAE}

3.939

10.606

2.424

5.455

9.091

6.515

4.127

3.828

2.233

6.89

8.254

5.263

F_{EO}

23.368

39.483

6.422

13.465

22.203

25.61

17.407

13.78

17.996

21.652

24.926

10.142

Intersection

F_{MEO}

12.5

100

22.222

16.667

11.111

44.444

22.222

100

33.333

4.167

F_{DP}

58.088

62.5

51.471

58.088

70.588

58.088

F_{OAE}

11.765

41.667

12.5

11.765

8.333

5.882

23.529

11.765

12.5

16.667

1.948

F_{EO}

51.536

230.67

59.507

44.743

59.982

39.11

42.671

103.142

55.926

161.146

85.72

14.412

ACC

97.525

90.099

95.792

98.02

97.03

98.02

97.772

93.812

96.535

94.307

96.287

99.01

AUC

99.395

97.839

99.299

99.687

99.508

99.392

99.781

97.987

98.521

99.515

99.808

99.98

98.918

97.589

99.226

99.691

99.461

99.215

99.792

97.182

98.25

99.525

99.812

99.983

EER

2.646

7.407

3.704

1.587

2.646

2.116

6.349

4.233

1.587

0.529

Table 21: Detailed fairness and utility evaluation results on MMDGAN.

Model Type

Naive

Frequency

Spatial

Fairness-enhanced

Dataset

Attribute

Metric

Xception

[61]

EfficientB4

[62]

ViT-B/16

[63]

F3Net

[64]

SPSL

[65]

SRM

[66]

UCF

[16]

UnivFD

[67]

CORE

[68]

DAW-FDD

[20]

DAG-FDD

[20]

PG-FDD

[21]

StarGAN

Gender

F_{MEO}

0.281

2.181

0.472

0.33

0.388

0.552

0.044

2.075

1.737

0.344

0.627

0.33

F_{DP}

4.379

5.639

3.887

4.547

4.555

4.744

4.461

5.32

5.209

4.678

4.954

4.65

F_{OAE}

0.181

0.604

0.435

0.3

0.275

0.332

0.008

0.095

0.214

0.07

0.107

0.197

F_{EO}

0.408

2.679

0.59

0.486

0.561

0.619

0.049

2.74

2.18

0.606

0.957

0.375

Race

F_{MEO}

4.577

11.031

22.727

11.197

6.113

2.062

F_{DP}

27.875

27.493

29.459

28.39

29.39

26.875

29.086

35.768

30.974

27.403

27.875

26.056

F_{OAE}

1.515

3.036

4.571

1.515

3.03

6.682

2.931

1.395

1.515

1.325

F_{EO}

7.666

11.827

17.243

8.507

10.863

8.762

10.006

29.26

17.857

12.09

6.291

4.109

Age

F_{MEO}

2.479

3.577

5.091

3.167

1.667

2.5

1.379

1.167

4.562

2.033

1.667

0.943

F_{DP}

19.078

17.476

16.434

19.802

19.399

19.078

19.319

17.244

18.927

17.39

19.078

19.158

F_{OAE}

1.132

2.264

2.119

0.323

1.201

2.264

1.132

2.075

0.843

0.908

1.509

0.601

F_{EO}

4.659

6.124

9.058

5.728

4.038

5.801

2.946

2.421

6.082

4.539

3.86

2.745

Intersection

F_{MEO}

14.286

12.5

21.774

6.25

11.111

6.25

14.286

18.75

11.111

5.556

2.381

F_{DP}

30.971

32.154

36.599

31.973

33.612

29.932

31.571

38.639

36.417

31.791

31.571

28.326

F_{OAE}

5.882

7.418

2.222

4.082

5.882

8.889

5.462

4.082

2.041

1.471

F_{EO}

22.432

38.089

41.688

20.567

19.704

19.605

25.484

59.844

36.081

26.652

11.756

8.426

ACC

99.326

98.289

96.216

99.015

99.274

99.378

99.43

94.66

96.319

98.237

99.482

99.533

AUC

99.874

99.773

99.556

99.909

99.86

99.869

99.964

99.626

99.076

99.796

99.909

99.983

99.899

99.797

99.56

99.933

99.826

99.832

99.97

99.724

99.079

99.809

99.929

99.986

EER

0.795

1.135

2.611

0.454

0.795

0.681

0.568

1.93

3.973

1.589

0.568

0.454

Table 22: Detailed fairness and utility evaluation results on StarGAN.

Model Type

Naive

Frequency

Spatial

Fairness-enhanced

Dataset

Attribute

Metric

Xception

[61]

EfficientB4

[62]

ViT-B/16

[63]

F3Net

[64]

SPSL

[65]

SRM

[66]

UCF

[16]

UnivFD

[67]

CORE

[68]

DAW-FDD

[20]

DAG-FDD

[20]

PG-FDD

[21]

StyleGAN

Gender

F_{MEO}

0.436

2.543

1.046

1.136

1.558

0.561

0.447

3.248

3.136

0.44

0.789

0.136

F_{DP}

20.675

21.927

20.009

20.923

21.864

20.013

20.847

20.617

21.168

20.521

21.311

21.15

F_{OAE}

0.17

0.208

0.869

0.551

0.344

0.496

0.205

1.003

0.187

0.029

0.404

0.027

F_{EO}

0.533

3.444

1.191

1.686

2.035

0.926

0.454

3.747

3.316

0.73

1.009

0.205

Race

F_{MEO}

4.078

13.916

11.459

4.498

3.659

11.815

2.439

20.18

18.22

2.941

1.22

2.105

F_{DP}

25.95

25.845

24.097

24.671

24.566

27.593

24.207

27.273

25.481

23.849

24.053

24.257

F_{OAE}

1.149

5.309

4.74

1.905

1.693

4.72

0.607

8.377

7.544

0.892

0.635

1.075

F_{EO}

7.696

21.343

19.263

8.82

6.977

15.593

4.642

30.007

26.199

5.893

3.345

3.163

Age

F_{MEO}

9.065

19.373

11.673

18.17

19.059

17.073

1.491

9.843

7.494

9.065

9.53

1.556

F_{DP}

49.291

52.488

44.166

48.832

50.785

49.475

47.41

40.723

44.455

49.356

49.553

48.55

F_{OAE}

0.783

2.425

14.36

2.943

2.104

1.163

1.068

10.291

8.482

0.908

1.64

1.333

F_{EO}

15.085

35.787

23.787

24.836

29.459

28.836

3.87

22.027

11.175

15.268

19.652

2.362

Intersection

F_{MEO}

7.407

17.306

16.78

7.143

17.857

3.704

24.774

26.04

4.054

3.571

2.817

F_{DP}

47.301

51.383

47.73

47.13

50.191

48.999

47.301

50.62

50.534

47.215

48.236

48.322

F_{OAE}

4.545

6.281

6.404

4.082

2.273

6.818

2.273

8.961

9.494

3.061

2.041

1.105

F_{EO}

20.331

50.793

39.145

22.806

19.225

41.872

11.758

53.806

50.322

16.809

12.616

7.417

ACC

98.975

97.819

96.347

98.476

99.08

97.976

99.527

94.77

96.399

99.054

99.448

99.685

AUC

99.925

99.794

99.392

99.861

99.964

99.902

99.985

99.703

99.51

99.892

99.986

99.979

99.94

99.854

99.386

99.904

99.97

99.916

99.988

99.756

99.316

99.925

99.989

99.981

EER

0.982

1.443

3.753

1.501

0.693

1.27

0.52

2.887

2.483

0.982

0.52

0.462

Table 23: Detailed fairness and utility evaluation results on StyleGAN.

Model Type

Naive

Frequency

Spatial

Fairness-enhanced

Dataset

Attribute

Metric

Xception

[61]

EfficientB4

[62]

ViT-B/16

[63]

F3Net

[64]

SPSL

[65]

SRM

[66]

UCF

[16]

UnivFD

[67]

CORE

[68]

DAW-FDD

[20]

DAG-FDD

[20]

PG-FDD

[21]

StyleGAN2

Gender

F_{MEO}

0.847

0.56

1.556

0.976

0.487

0.666

0.27

0.447

1.073

0.534

0.241

0.045

F_{DP}

3.512

2.482

3.231

3.434

2.926

2.668

2.988

2.775

2.129

2.984

2.961

2.871

F_{OAE}

0.077

0.341

0.624

0.246

0.342

0.636

0.092

0.117

0.698

0.31

0.092

0.022

F_{EO}

1.538

0.594

1.7

1.338

0.686

1.192

0.317

0.482

1.113

0.56

0.263

0.047

Race

F_{MEO}

1.037

5.147

6.385

1.057

1.401

6.072

1.244

15.519

16.197

2.565

0.926

0.517

F_{DP}

33.803

35.296

33.103

34.638

35.076

36.619

35.925

38.228

38.381

34.674

35.711

35.522

F_{OAE}

1.451

1.354

1.4

0.471

0.506

1.907

0.229

2.583

2.74

1.331

0.38

0.247

F_{EO}

2.251

7.675

13.968

2.428

2.762

7.489

2.377

17.827

19.998

5.173

2.369

1.48

Age

F_{MEO}

2.766

2.561

8.543

3.408

3.328

2.493

2.486

9.408

9.634

6.514

2.532

0.607

F_{DP}

16.251

16.177

16.677

16.418

16.74

15.95

16.669

16.91

18.079

12.621

16.323

15.762

F_{OAE}

1.016

0.375

2.35

1.05

1.265

2.008

0.647

1.74

2.042

5.433

0.97

0.528

F_{EO}

5.353

2.926

13.051

5.249

4.883

6.547

3.779

12.811

10.052

11.092

3.689

1.966

Intersection

F_{MEO}

2.436

5.448

9.55

2.127

2.384

7.475

1.468

18.286

20.753

3.132

1.511

0.695

F_{DP}

37.77

39.222

35.56

38.81

38.411

39.991

39.446

42.732

42.186

38.965

39.128

38.732

F_{OAE}

1.896

2.822

2.726

0.658

1.073

2.369

0.643

3.644

4.862

1.795

0.575

0.488

F_{EO}

9.7

18.646

26.948

7.826

7.696

17.235

5.947

35.527

41.168

13.195

5.737

3.074

ACC

97.46

98.044

95.299

98.472

98.799

98.331

99.311

94.745

96.23

97.207

99.32

99.479

AUC

99.738

99.698

98.85

99.794

99.816

99.741

99.877

99.205

99.209

99.713

99.883

99.968

99.787

99.656

98.871

99.819

99.794

99.715

99.861

99.205

98.979

99.759

99.901

99.97

EER

2.161

1.066

5.234

1.542

1.17

1.309

0.704

3.906

3.019

2.374

0.699

0.535

Table 24: Detailed fairness and utility evaluation results on StyleGAN2.

Model Type

Naive

Frequency

Spatial

Fairness-enhanced

Dataset

Attribute

Metric

Xception

[61]

EfficientB4

[62]

ViT-B/16

[63]

F3Net

[64]

SPSL

[65]

SRM

[66]

UCF

[16]

UnivFD

[67]

CORE

[68]

DAW-FDD

[20]

DAG-FDD

[20]

PG-FDD

[21]

StyleGAN3

Gender

F_{MEO}

0.645

1.688

1.345

0.399

0.177

0.132

0.339

1.194

1.626

1.868

0.378

0.23

F_{DP}

5.585

6.032

4.595

5.195

5.327

5.274

5.532

5.417

5.837

6.14

5.536

5.482

F_{OAE}

0.374

0.558

1.26

0.235

0.113

0.031

0.174

0.062

0.221

0.886

0.173

0.086

F_{EO}

0.774

1.792

1.843

0.56

0.254

0.219

0.382

1.305

1.842

1.952

0.421

0.264

Race

F_{MEO}

1.701

8.361

11.792

0.605

0.498

1.546

0.893

19.275

17.174

2.079

0.708

3.61

F_{DP}

41.75

42.909

41.384

42.642

43.108

42.823

43.681

45.108

45.534

42.603

43.073

44.332

F_{OAE}

0.259

1.66

2.29

0.514

0.47

0.436

0.612

2.527

2.103

0.757

0.335

0.761

F_{EO}

4.543

13.614

18.916

1.177

1.092

2.955

1.47

21.59

24.212

4.237

2.164

4.885

Age

F_{MEO}

1.403

2.138

10.825

1.727

0.743

0.459

2.1

11.432

14.27

3.777

1.792

0.892

F_{DP}

14.913

14.783

17.612

14.285

14.734

14.04

17.735

19.446

15.206

13.967

15.198

F_{OAE}

0.782

0.986

4.387

1.1

0.612

0.465

1.408

3.895

5.177

1.775

0.984

0.31

F_{EO}

3.378

4.108

14.744

3.685

1.714

1.141

3.498

15.191

16.05

5.888

3.075

1.073

Intersection

F_{MEO}

2.439

10.814

14.81

1.096

1.429

2.381

1.429

24.377

22.722

3.681

2.439

4.138

F_{DP}

50.071

51.956

50.376

50.55

51.369

51.129

52.043

53.357

55.475

51.27

51.514

52.961

F_{OAE}

0.893

2.808

3.841

0.644

1.013

0.526

1.124

6.702

5.604

2.306

2.143

1.429

F_{EO}

11.913

30.289

40.686

3.519

4.723

6.395

5.967

44.341

52.555

14.502

7.765

12.542

ACC

98.696

98.009

95.771

98.645

99.364

99.548

99.374

94.703

96.12

97.444

99.199

99.672

AUC

99.86

99.613

99.263

99.863

99.923

99.906

99.941

99.04

98.621

99.749

99.929

99.996

99.906

99.568

99.302

99.906

99.951

99.9

99.961

99.172

98.577

99.814

99.956

99.996

EER

1.373

1.733

4.142

1.351

0.675

0.45

0.72

4.66

5.088

2.139

0.653

0.36

Table 25: Detailed fairness and utility evaluation results on StyleGAN3.

Model Type

Naive

Frequency

Spatial

Fairness-enhanced

Dataset

Attribute

Metric

Xception

[61]

EfficientB4

[62]

ViT-B/16

[63]

F3Net

[64]

SPSL

[65]

SRM

[66]

UCF

[16]

UnivFD

[67]

CORE

[68]

DAW-FDD

[20]

DAG-FDD

[20]

PG-FDD

[21]

MSG StyleGAN

Gender

F_{MEO}

1.136

12.078

0.703

3.409

0.654

2.219

0.654

14.301

7.359

0.614

0.039

F_{DP}

27.6

31.71

25.179

26.151

27.903

26.151

27.903

15.658

29.646

27.059

27.784

28.325

F_{OAE}

0.725

3.146

2.174

0.422

1.33

0.422

8.063

1.321

0.66

0.541

F_{EO}

1.136

13.323

0.703

3.409

0.654

2.872

0.654

15.978

7.359

0.668

0.039

Race

F_{MEO}

0.709

12.5

18.762

9.091

0.515

17.473

0.515

18.182

13.217

2.577

12.5

F_{DP}

49.876

46.294

36.493

50.174

49.279

32.91

49.279

41.228

43.333

48.682

49.577

F_{OAE}

0.299

10.526

13.085

5.263

0.299

16.07

0.299

7.456

8.437

2.09

5.263

F_{EO}

1.291

22.112

27.06

9.417

1.008

25.246

1.008

50.251

14.169

7.622

12.924

Age

F_{MEO}

2.857

37.594

5.714

2.894

0.526

16.667

0.526

33.333

12.381

3.846

15.614

F_{DP}

47.334

49.913

48.962

47.673

49.434

50.451

49.434

42.249

51.74

48.417

51.534

49.773

F_{OAE}

2.439

6.652

1.993

2.691

0.339

4.2

0.339

4.807

4.407

3.03

2.352

F_{EO}

3.439

51.965

7.906

4.007

1.019

19.632

1.019

37.676

17.663

9.151

27.929

Intersection

F_{MEO}

1.493

41.892

11.111

0.667

2.667

F_{DP}

55.853

54.067

57.778

55.853

55.407

33.631

55.407

68.889

54.514

55.853

F_{OAE}

0.901

14.286

17.321

7.143

0.446

22.222

0.446

2.232

7.143

F_{EO}

3.237

54.364

57.775

15.84

2.144

39.23

2.144

123.31

59.65

11.79

23.97

ACC

99.733

95.467

99.2

99.733

98.667

99.733

86.933

96.267

98.133

98.933

100

AUC

99.997

98.943

99.834

99.994

100

99.928

100

94.53

96.669

99.908

100

99.998

97.156

99.863

99.995

100

99.939

100

93.162

95.249

99.922

100

EER

0.581

4.651

2.326

1.163

11.628

6.395

1.163

Table 26: Detailed fairness and utility evaluation results on MSG-StyleGAN.

Model Type

Naive

Frequency

Spatial

Fairness-enhanced

Dataset

Attribute

Metric

Xception

[61]

EfficientB4

[62]

ViT-B/16

[63]

F3Net

[64]

SPSL

[65]

SRM

[66]

UCF

[16]

UnivFD

[67]

CORE

[68]

DAW-FDD

[20]

DAG-FDD

[20]

PG-FDD

[21]

ProGAN

Gender

F_{MEO}

0.381

1.296

1.134

0.428

0.315

0.403

0.243

0.834

2.333

0.236

0.242

0.138

F_{DP}

16.706

17.133

15.201

16.709

16.918

16.671

17.043

15.652

16.838

16.708

17.082

16.882

F_{OAE}

0.341

0.006

1.793

0.396

0.305

0.187

0.131

1.442

0.227

0.236

0.081

0.19

F_{EO}

0.504

1.501

1.137

0.610

0.613

0.437

0.257

0.947

2.389

0.304

0.271

0.167

Race

F_{MEO}

3.598

5.357

9.424

2.721

3.565

4.177

2.743

18.043

20.491

3.027

3.149

0.822

F_{DP}

35.504

30.506

25.824

35.179

36.143

35.551

35.743

22.053

18.852

35.97

35.661

34.624

F_{OAE}

1.285

5.514

10.036

1.232

0.609

1.279

0.693

15.502

17.141

0.48

0.926

0.844

F_{EO}

5.759

9.542

14.448

4.235

4.912

5.619

3.788

22.113

25.053

4.519

4.381

1.968

Age

F_{MEO}

1.374

1.953

8.853

0.898

0.798

0.932

0.656

6.036

5.397

1.262

0.656

1.244

F_{DP}

21.22

22.383

19.714

21.441

21.587

21.342

21.717

19.868

21.597

21.319

21.817

21.601

F_{OAE}

0.897

0.912

5.411

0.804

0.583

0.726

0.503

4.49

2.83

1.067

0.367

0.702

F_{EO}

2.875

3.954

11.106

2.453

2.53

2.744

1.763

10.015

6.825

3.491

1.646

1.582

Intersection

F_{MEO}

4.284

6.829

10.793

6.557

6.581

8.513

6.604

20.936

24.845

6.406

6.581

0.935

F_{DP}

50.437

47.389

40.919

50.462

51.762

51.533

52.347

38.606

36.209

51.858

51.988

50.523

F_{OAE}

1.882

5.836

11.532

2.432

1.162

1.529

0.818

16.738

19.088

0.956

0.993

0.961

F_{EO}

13.14

22.712

31.013

13.286

13.577

16.796

11.759

42.997

52.525

13.479

12.476

3.729

ACC

99.357

98.286

96.458

99.344

99.558

99.384

99.639

95.045

96.418

99.243

99.688

99.68

AUC

99.968

99.899

99.84

99.938

99.961

99.948

99.977

99.895

99.105

99.954

99.984

99.996

99.976

99.928

99.861

99.959

99.974

99.959

99.984

99.927

98.838

99.966

99.988

99.997

EER

0.535

0.916

1.838

0.547

0.44

0.595

0.363

0.69

3.094

0.696

0.345

0.321

Table 27: Detailed fairness and utility evaluation results on ProGAN.

Model Type

Naive

Frequency

Spatial

Fairness-enhanced

Dataset

Attribute

Metric

Xception

[61]

EfficientB4

[62]

ViT-B/16

[63]

F3Net

[64]

SPSL

[65]

SRM

[66]

UCF

[16]

UnivFD

[67]

CORE

[68]

DAW-FDD

[20]

DAG-FDD

[20]

PG-FDD

[21]

STGAN

Gender

F_{MEO}

17.404

18.772

4.36

13.333

12.912

5.087

6.737

12.632

7.965

12.596

14.737

5.333

F_{DP}

16.581

5.161

12.984

13.419

7.823

11.774

0.097

14.984

12.935

15.645

11.194

F_{OAE}

7.968

10.419

2.452

8.532

7.387

4.661

2.452

12.774

0.048

5.903

6.323

2.581

F_{EO}

17.404

21.708

5.238

17.171

15.412

9.087

7.812

19.085

15.123

13.934

15.812

5.333

Race

F_{MEO}

13.299

4.167

11.765

17.647

4.412

17.647

26.961

5.882

F_{DP}

32.197

18.561

20.613

19.048

19.697

22.811

13.258

29.337

13.62

18.215

20.613

F_{OAE}

8.97

4.5

2.464

3.297

36.742

10.023

19.833

7.955

F_{EO}

18.036

47.951

6.249

27.305

24.367

17.005

33.472

83.525

53.832

35.574

17.67

8.072

Age

F_{MEO}

18.277

28.125

22.581

23.333

8.696

22.5

14.146

19.916

20.784

F_{DP}

10.656

11.688

10.343

14.52

19.438

10.82

16.159

10.134

8.765

13.574

14.844

12.881

F_{OAE}

10.99

11.475

13.115

11.475

5.455

11.475

14.637

8.67

12.404

9.115

4.918

F_{EO}

26.439

43.956

36.21

42.739

41.293

15.487

29.446

43.298

35.301

48.478

32.922

13.125

Intersection

F_{MEO}

28.571

30.612

16.667

16.327

23.077

7.812

66.667

38.462

24.49

7.692

F_{DP}

40.833

39.167

35.177

32.78

36.947

33.333

47.677

23.82

33.038

F_{OAE}

16.667

22.222

7.143

10.619

15.789

7.08

16.667

48.333

20.833

26.316

16.667

5.263

F_{EO}

73.153

131.743

58.478

75.809

70.579

43.757

85.001

236.087

150.768

101.162

90.191

22.28

ACC

93.521

88.451

94.93

95.775

97.465

94.93

69.577

93.521

90.423

93.239

98.873

AUC

99.573

96.465

98.534

99.541

99.547

97.807

99.538

85.921

97.335

99.194

99.522

99.908

99.639

95.872

98.139

99.59

99.607

97.132

99.579

82.016

96.354

99.242

99.554

99.922

EER

4.217

7.831

3.614

19.277

6.024

3.614

Table 28: Detailed fairness and utility evaluation results on STGAN.

Model Type

Naive

Frequency

Spatial

Fairness-enhanced

Dataset

Attribute

Metric

Xception

[61]

EfficientB4

[62]

ViT-B/16

[63]

F3Net

[64]

SPSL

[65]

SRM

[66]

UCF

[16]

UnivFD

[67]

CORE

[68]

DAW-FDD

[20]

DAG-FDD

[20]

PG-FDD

[21]

VQGAN

Gender

F_{MEO}

0.359

1.689

0.682

0.732

0.368

0.47

0.241

1.789

1.021

0.642

0.338

0.207

F_{DP}

12.221

13.117

11.324

12.105

12.491

12.466

12.581

9.224

12.203

12.216

12.609

12.755

F_{OAE}

0.231

0.067

1.264

0.49

0.377

0.454

0.187

0.12

0.625

0.482

0.243

0.069

F_{EO}

0.371

2.198

0.722

0.829

0.719

0.876

0.367

2.226

1.274

0.853

0.513

0.386

Race

F_{MEO}

1.267

8.893

10.064

2.257

3.158

2.321

3.273

20.147

20.549

4.11

3.183

1.429

F_{DP}

59.881

60.677

57.134

60.756

60.89

60.225

61.269

54.127

62.9

60.342

61.099

61.113

F_{OAE}

0.692

1.985

3.926

0.611

0.47

0.855

0.58

9.087

2.245

0.726

0.518

0.283

F_{EO}

3.139

13.952

15.64

3.751

4.581

4.174

4.947

34.162

25.23

6.323

4.388

2.143

Age

F_{MEO}

1.104

1.105

9.03

1.813

1.533

0.648

5.925

8.977

1.61

0.703

0.897

F_{DP}

29.715

29.264

30.54

29.734

30.239

30.338

30.415

22.037

31.068

30.059

30.343

30.537

F_{OAE}

0.956

0.953

2.392

0.984

0.785

0.866

0.339

4.197

3.12

0.992

0.444

0.461

F_{EO}

2.363

4.055

11.65

3.145

2.892

3.001

1.344

18.036

10.529

3.499

1.421

1.653

Intersection

F_{MEO}

3.846

13.893

12.515

3.44

3.504

3.112

3.671

23.965

25.638

4.721

3.525

F_{DP}

67.678

69.217

64.056

68.414

68.785

67.971

69.072

60.076

71.221

68.725

69.149

69.359

F_{OAE}

1.703

3.567

5.313

0.889

1.029

1.408

0.787

10.854

3.869

2.36

1.186

0.976

F_{EO}

9.427

27.715

31.838

9.934

10.748

11.46

11.34

66.065

46.268

16.097

10.787

5.751

ACC

99.092

97.936

96.313

99.102

99.344

99.387

99.543

91.217

96.248

99.135

99.538

99.758

AUC

99.909

99.746

99.699

99.883

99.878

99.879

99.912

96.257

98.872

99.912

99.938

99.99

99.926

99.755

99.716

99.901

99.871

99.855

99.895

96.027

98.822

99.932

99.952

99.991

EER

0.835

1.565

2.554

0.706

0.683

0.588

0.447

9.508

4.46

0.812

0.447

0.306

Table 29: Detailed fairness and utility evaluation results on VQGAN.

Model Type

Naive

Frequency

Spatial

Fairness-enhanced

Dataset

Attribute

Metric

Xception

[61]

EfficientB4

[62]

ViT-B/16

[63]

F3Net

[64]

SPSL

[65]

SRM

[66]

UCF

[16]

UnivFD

[67]

CORE

[68]

DAW-FDD

[20]

DAG-FDD

[20]

PG-FDD

[21]

Commercial Tools

Gender

F_{MEO}

7.689

8.221

4.167

4.432

2.879

0.435

3.258

8.864

2.708

2.348

4.432

2.083

F_{DP}

18.977

22.798

18.901

18.527

16.616

16.540

18.714

20.701

16.540

16.990

17.440

17.177

F_{OAE}

5.961

1.613

3.337

2.326

1.350

2.250

5.137

3.524

2.700

4.424

2.887

F_{EO}

10.397

14.471

5.714

4.867

3.459

0.435

4.950

10.411

4.791

2.929

7.140

2.664

Race

F_{MEO}

33.333

14.286

33.333

28.571

33.333

28.571

33.333

F_{DP}

69.398

61.706

71.572

69.398

65.050

73.746

69.398

65.050

73.746

71.572

74.089

F_{OAE}

21.053

7.018

10.526

15.789

4.348

8.696

13.043

10.526

15.789

5.263

F_{EO}

77.098

29.869

62.147

52.682

73.558

36.044

48.191

50.286

63.706

70.345

70.475

57.018

Age

F_{MEO}

28.571

14.286

28.571

6.711

12.500

14.286

5.833

F_{DP}

49.346

58.824

52.778

50.817

47.876

52.288

51.797

49.346

52.288

51.307

50.327

52.288

F_{OAE}

9.225

10.205

8.170

6.566

11.111

8.170

6.566

9.641

7.680

8.660

F_{EO}

37.679

36.583

21.513

43.361

20.392

13.639

27.762

28.094

22.990

22.156

22.837

8.656

Intersection

F_{MEO}

50.000

25.000

50.000

33.333

50.000

33.333

50.000

F_{DP}

65.714

62.637

68.889

68.000

72.000

68.000

65.714

72.000

74.444

F_{OAE}

22.222

8.547

11.111

19.048

5.128

9.524

20.000

11.111

16.667

7.692

F_{EO}

131.495

67.938

105.685

81.793

129.065

59.940

76.252

91.701

105.535

114.665

126.561

109.221

ACC

93.976

95.582

92.771

97.590

95.984

92.369

96.787

95.181

96.386

AUC

95.778

99.541

99.005

96.349

94.798

95.716

95.681

96.808

97.371

97.141

94.812

93.365

96.193

99.751

99.401

96.966

95.607

96.184

93.761

98.000

98.153

97.779

94.066

90.493

EER

7.692

3.297

5.495

6.593

8.791

6.593

9.890

6.593

7.692

Table 30: Detailed fairness and utility evaluation results on Commercial Tools (DALLE2, IF & Midjourney).

Model Type

Naive

Frequency

Spatial

Fairness-enhanced

Dataset

Attribute

Metric

Xception

[61]

EfficientB4

[62]

ViT-B/16

[63]

F3Net

[64]

SPSL

[65]

SRM

[66]

UCF

[16]

UnivFD

[67]

CORE

[68]

DAW-FDD

[20]

DAG-FDD

[20]

PG-FDD

[21]

DCFace

Gender

F_{MEO}

0.525

0.87

1.201

0.637

0.196

0.338

0.052

0.596

1.77

0.368

0.066

0.04

F_{DP}

5.465

4.56

5.28

5.534

5.342

5.371

5.274

4.013

4.012

5.351

5.265

5.213

F_{OAE}

0.191

0.412

0.311

0.214

0.062

0.163

0.015

0.191

1.155

0.161

0.048

0.018

F_{EO}

0.549

0.873

1.393

0.704

0.231

0.343

0.066

0.944

1.86

0.415

0.085

0.076

Race

F_{MEO}

0.667

5.55

8.151

0.608

0.737

0.591

0.663

16.794

18.582

0.78

0.669

0.938

F_{DP}

18.219

18.877

20.375

18.181

18.285

18.201

18.441

26.177

25.501

18.289

18.292

18.522

F_{OAE}

0.535

2.088

3.419

0.384

0.294

0.359

0.431

5.528

6.893

0.322

0.304

0.392

F_{EO}

2.333

10.815

14.05

1.889

1.94

1.649

2.154

29.148

23.112

1.838

1.661

1.454

Age

F_{MEO}

1.448

2.989

9.273

1.071

0.567

1.055

0.425

6.914

7.59

1.253

0.706

0.594

F_{DP}

12.708

13.621

9.357

12.926

13.119

12.754

13.08

10.929

8.445

12.763

13.314

12.741

F_{OAE}

0.918

1.218

4.94

0.772

0.501

0.831

0.358

6.765

4.839

0.973

0.4

0.388

F_{EO}

2.782

6.577

12.53

2.419

1.552

2.202

1.108

15.928

8.071

2.468

1.232

0.924

Intersection

F_{MEO}

1.327

7.377

9.272

1.463

0.892

0.923

0.866

19.337

22.619

0.886

0.868

1.29

F_{DP}

21.136

20.833

22.531

21.04

20.964

20.906

21.006

27.504

28.454

21.116

20.984

21.006

F_{OAE}

0.764

3.089

4.017

0.649

0.362

0.518

0.498

6.588

8.981

0.548

0.391

0.568

F_{EO}

5.666

22.247

27.649

6.712

4.452

4.176

4.367

58.647

45.831

5.043

3.709

2.873

ACC

99.361

96.935

96.038

99.314

99.542

99.395

99.627

92.834

96.443

99.329

99.654

99.727

AUC

99.961

99.513

99.718

99.938

99.934

99.956

99.965

97.415

99.129

99.956

99.965

99.994

99.972

99.602

99.776

99.955

99.947

99.963

99.97

97.347

98.913

99.969

99.977

99.995

EER

0.422

3.07

2.649

0.414

0.612

0.363

7.661

3.26

0.515

0.368

0.322

Table 31: Detailed fairness and utility evaluation results on DCFace.

Model Type

Naive

Frequency

Spatial

Fairness-enhanced

Dataset

Attribute

Metric

Xception

[61]

EfficientB4

[62]

ViT-B/16

[63]

F3Net

[64]

SPSL

[65]

SRM

[66]

UCF

[16]

UnivFD

[67]

CORE

[68]

DAW-FDD

[20]

DAG-FDD

[20]

PG-FDD

[21]

Latent Diffusion

Gender

F_{MEO}

0.711

1.315

6.822

0.587

0.269

0.404

0.343

0.523

1.341

0.709

0.343

0.005

F_{DP}

6.704

7.776

1.172

6.844

7.154

7.227

7.274

6.063

7.26

6.444

7.367

7.15

F_{OAE}

0.339

0.192

3.448

0.376

0.232

0.315

0.206

0.052

0.059

0.069

0.113

0.02

F_{EO}

0.723

1.748

8.477

0.706

0.467

0.658

0.411

0.993

1.39

1.186

0.464

0.005

Race

F_{MEO}

1.377

7.837

10.67

1.602

0.319

0.559

1.116

19.547

20.291

0.763

0.633

0.503

F_{DP}

39.89

41.058

30.225

39.89

40.64

40.819

40.462

39.759

42.918

40.116

40.95

40.593

F_{OAE}

1.202

1.262

9.842

0.921

0.206

0.387

0.658

4.85

2.278

0.691

0.299

0.31

F_{EO}

2.443

10.604

26.477

3.081

0.884

1.541

2.659

30.757

21.909

2.097

1.363

0.765

Age

F_{MEO}

2.771

2.325

12.515

1.798

0.9

0.803

0.762

3.544

4.117

1.896

0.571

1.604

F_{DP}

20.503

20.83

20.755

20.183

20.119

19.742

19.955

15.919

20.598

20.183

20.062

20.823

F_{OAE}

0.913

0.495

5.504

0.508

0.434

0.505

0.275

2.119

0.437

1.088

0.319

0.518

F_{EO}

4.881

4.303

35.584

3.991

1.97

1.782

1.411

11.437

6.184

4.923

1.425

2.165

Intersection

F_{MEO}

3.571

10.805

13.52

3.846

1.786

22.411

24.881

3.226

1.786

0.714

F_{DP}

52.751

53.892

39.806

52.751

53.771

54.281

53.441

50.799

55.935

52.87

54.461

53.621

F_{OAE}

3.061

2.274

13.761

2.643

1.322

1.531

5.062

2.708

1.442

0.541

0.51

F_{EO}

10.619

23.761

61.046

11.037

7.073

5.892

8.286

57.444

45.514

11.596

5.021

1.768

ACC

99.066

98.528

88.706

99.179

99.505

99.674

99.646

92.669

96.519

98.981

99.689

99.887

AUC

99.921

99.948

96.795

99.908

99.942

99.968

99.972

97.153

98.926

99.916

99.971

99.999

99.945

99.961

96.469

99.931

99.965

99.976

99.983

96.901

98.668

99.94

99.981

99.999

EER

0.906

0.531

9.031

0.688

0.5

0.406

0.375

3.719

0.469

0.156

Table 32: Detailed fairness and utility evaluation results on Latent Diffusion.

Model Type

Naive

Frequency

Spatial

Fairness-enhanced

Dataset

Attribute

Metric

Xception

[61]

EfficientB4

[62]

ViT-B/16

[63]

F3Net

[64]

SPSL

[65]

SRM

[66]

UCF

[16]

UnivFD

[67]

CORE

[68]

DAW-FDD

[20]

DAG-FDD

[20]

PG-FDD

[21]

Palette

Gender

F_{MEO}

1.164

0.544

1.2

1.164

1.121

2.159

1.611

9.265

0.757

1.348

0.503

0.727

F_{DP}

13.196

13.889

12.155

13.052

13.571

13.763

13.705

2.952

13.466

12.548

13.542

13.928

F_{OAE}

1.164

0.278

2.166

1.02

0.848

1.308

1.077

9.412

0.95

0.803

0.55

0.147

F_{EO}

1.999

0.866

1.844

1.45

1.814

3.043

2.412

10.763

1.438

1.536

0.884

0.791

Race

F_{MEO}

3.659

5.947

11.668

3.333

6.098

7.317

6.098

20.528

20.242

5.108

2.83

7.317

F_{DP}

5.979

6.385

4.834

7.61

5.135

4.144

5.922

16.802

8.742

5.776

7.261

5.922

F_{OAE}

1.965

4.123

8.436

2.97

2.963

2.046

2.329

6.372

13.877

2.402

2.062

3.319

F_{EO}

8.157

11.825

18.769

7.416

9.477

14.029

10.363

43.621

26.601

12.368

6.161

10.827

Age

F_{MEO}

4.688

3.002

7.966

3.756

4.042

3.765

4.425

12.333

8.923

4.995

4.042

1.948

F_{DP}

19.14

19.394

18.86

20.025

18.688

19.789

20.438

14.893

21.909

20.674

17.134

F_{OAE}

3.534

1.426

4.149

2.78

3.775

3.392

2.765

10.627

4.861

2.715

2.271

0.865

F_{EO}

11.998

6.893

14.031

9.808

11.945

12.703

12.283

38.971

10.877

11.605

9.98

3.288

Intersection

F_{MEO}

9.375

8.995

20.066

5.556

12.5

22.231

12.5

4.412

9.375

F_{DP}

37.067

39.108

27.455

36.047

37.067

37.352

38.372

23.106

27.997

36.047

39.393

35.026

F_{OAE}

5.769

5.454

18.902

4.808

22.672

17.864

4.137

3.54

4.082

F_{EO}

24.367

23.905

43.338

18.623

24.071

33.455

25.185

150.941

53.279

30.359

16.799

27.109

ACC

98.547

97.465

94.189

98.671

98.578

98.423

98.702

73.447

94.405

97.682

98.887

99.073

AUC

99.736

99.581

99.501

99.756

99.644

99.387

99.856

80.642

97.922

99.704

99.781

99.923

99.423

98.911

99.063

99.497

99.079

98.07

99.725

67.995

95.558

99.432

99.657

99.867

EER

1.525

1.672

2.951

1.279

1.426

1.574

1.328

26.365

6.05

2.361

1.279

1.082

Table 33: Detailed fairness and utility evaluation results on Palette.

Model Type

Naive

Frequency

Spatial

Fairness-enhanced

Dataset

Attribute

Metric

Xception

[61]

EfficientB4

[62]

ViT-B/16

[63]

F3Net

[64]

SPSL

[65]

SRM

[66]

UCF

[16]

UnivFD

[67]

CORE

[68]

DAW-FDD

[20]

DAG-FDD

[20]

PG-FDD

[21]

SD1.5

Gender

F_{MEO}

0.509

3.139

0.317

1.139

1.186

0.248

1.674

2.635

0.891

0.457

1.275

0.345

F_{DP}

22.081

19.548

20.978

22.834

22.776

22.724

22.857

11.878

20.185

20.993

23.084

22.934

F_{OAE}

0.495

1.293

1.576

0.178

0.153

0.021

0.332

4.404

2.235

1.599

0.07

0.119

F_{EO}

0.613

3.185

0.554

1.705

1.56

0.359

1.886

3.738

1.065

0.625

1.661

0.399

Race

F_{MEO}

1.968

7.1

8.289

2.12

1.5

1.342

1.959

19.503

18.674

2.837

1.469

1.323

F_{DP}

15.903

16.759

15.365

15.889

15.045

14.817

15.624

11.712

14.571

14.654

15.689

14.94

F_{OAE}

0.943

2.918

4.108

1.294

0.879

0.297

1.454

10.28

9.562

1.645

1.347

0.901

F_{EO}

4.614

15.815

13.016

4.227

2.198

2.227

3.486

40.819

27.388

6.944

3.342

3.124

Age

F_{MEO}

2.87

3.457

5.482

1.658

3.43

1.295

3.244

5.749

11.164

4.406

2.534

1.061

F_{DP}

31.026

28.27

31.043

30.768

32.076

31.734

31.627

16.203

30.92

31.262

31.78

32.059

F_{OAE}

2.054

2.83

1.832

1.244

2.275

0.828

2.1

11.13

2.965

2.806

1.818

0.942

F_{EO}

6.24

9.127

9.167

3.913

6.148

2.276

5.276

12.176

15.841

7.182

5.215

2.787

Intersection

F_{MEO}

3.333

11.68

11.018

3.333

6.667

2.439

6.206

24.497

23.778

3.283

6.206

1.695

F_{DP}

34.936

32.557

32.823

35.564

34.066

33.963

34.985

24.536

30.941

32.333

35.227

34.35

F_{OAE}

1.928

5.084

4.661

1.915

3.382

1.667

2.27

14.398

11.932

3.428

2.27

1.208

F_{EO}

10.679

35.812

28.777

12.75

18.354

10.377

15.07

88.474

69.692

15.9

13.668

8.346

ACC

97.272

95.847

95.862

97.833

97.848

99.045

98.151

73.219

94.983

95.696

98.424

99.47

AUC

99.792

98.953

99.499

99.803

99.826

99.766

99.877

86.563

97.922

99.63

99.893

99.963

99.832

98.661

99.538

99.828

99.862

99.716

99.914

85.449

97.861

99.675

99.928

99.969

EER

1.887

4.073

2.947

1.755

1.523

0.993

1.192

21.159

6.424

2.682

0.861

0.53

Table 34: Detailed fairness and utility evaluation results on SD v1.5.

Model Type

Naive

Frequency

Spatial

Fairness-enhanced

Dataset

Attribute

Metric

Xception

[61]

EfficientB4

[62]

ViT-B/16

[63]

F3Net

[64]

SPSL

[65]

SRM

[66]

UCF

[16]

UnivFD

[67]

CORE

[68]

DAW-FDD

[20]

DAG-FDD

[20]

PG-FDD

[21]

SD Inpainting

Gender

F_{MEO}

2.241

6.934

2.814

2.686

1.636

2.432

1.652

4.449

1.288

2.495

1.455

0.849

F_{DP}

20.739

14.685

21.614

23.834

21.747

23.587

23.36

12.632

19.574

19.711

22.764

22.798

F_{OAE}

2.701

7.184

0.172

1.041

2.434

1.154

0.901

6.926

1.511

3.887

1.367

0.746

F_{EO}

2.704

6.971

3.64

3.722

3.096

2.449

2.302

5.737

2.278

2.932

1.837

1.048

Race

F_{MEO}

2.353

8.159

13.709

4.767

2.884

3.957

4.424

15.135

23.357

4.628

4.424

1.599

F_{DP}

11.33

9.279

10.32

11.773

10.267

12.654

11.567

6.414

10.94

12.332

13.285

12.717

F_{OAE}

2.569

6.514

5.566

1.598

2.857

1.908

2.863

7.86

9.299

2.983

2.904

1.378

F_{EO}

6.652

21.735

26.462

7.096

7.51

8.601

8.099

34.693

34.503

9.538

11.207

3.61

Age

F_{MEO}

6.106

10.945

8.907

8.131

6.14

6.319

6.518

5.833

5.738

6.494

6.329

1.77

F_{DP}

34.829

24.512

35.588

35.072

36.422

36.373

35.589

22.872

33.578

35.295

35.397

36.172

F_{OAE}

5.83

11.618

4.664

3.031

4.865

2.925

3.538

10.413

2.475

6.172

3.678

1.276

F_{EO}

11.823

19.295

20.31

11.192

12.313

8.877

10.26

16.169

10.115

11.654

10.325

4.012

Intersection

F_{MEO}

6.237

19.863

19.037

8.725

5.369

6.711

7.383

19.06

31.294

12.213

7.383

3.693

F_{DP}

27.04

21.323

28.54

30.477

27.278

30.434

29.669

17.974

28.629

25.716

30.228

30.269

F_{OAE}

4.884

13.648

6.769

4.329

5.395

3.493

4.397

14.884

12.987

9.387

6.247

3.194

F_{EO}

18.736

61.309

52.861

25.231

20.741

27.99

21.834

69.995

61.174

28.951

22.144

11.665

ACC

95.133

86.754

94.333

96.517

95.475

97.445

96.86

78.49

94.105

92.849

96.846

98.715

AUC

99.552

97.281

98.31

99.525

99.547

99.659

99.687

89.51

97.403

99.386

99.707

99.912

99.679

97.138

98.529

99.652

99.677

99.727

99.766

91.313

97.767

99.564

99.79

99.939

EER

3.434

7.631

6.729

3.226

3.33

2.463

2.428

17.933

7.527

3.954

2.393

1.283

Table 35: Detailed fairness and utility evaluation results on SD Inpainting.

B.5 Details of Post-Processing

In Section 4 we have applied 6 post-processing methods to evaluate detectors’ robustness. Fig. B.1 visualizes the image after being applied different post-processing methods. We describe each post-processing method as follows:

JPEG Compression: Image compression introduces compression artifacts and reduces the image quality, simulating real-world scenarios where images may be of lower quality or have compression artifacts. In Fig. 6 we apply image compression with quality 60 to each image in the test set.

Gaussian Blur: This post-processing reduces image detail and noise by smoothing it through averaging pixel values with a Gaussian kernel. In Fig. 6 we apply gaussian blur with kernel size 7 to each image in the test set.

Hue Saturation Value: Alters the hue, saturation, and value of the image within specified limits. This post-processing technique is used to simulate variations in color and lighting conditions. Adjusting the hue changes the overall color tone, saturation controls the intensity of colors, and value adjusts the brightness. The results in Fig. 6 are after we adjust hue, saturation, and value with shifting limits 30.

Random Brightness and Contrast: This post-processing method adjusts the brightness and contrast of the image within specified limits. By applying random brightness and contrast variations, it introduces changes in the illumination and contrast levels of the images. This evaluates detector’s robustness to different illumination conditions. The results in Fig. 6 are after we adjust brightness and contrast with shifting limits 0.2.

Random Crop: Resizes the image to a specified size and then randomly crops a portion of it to the target dimensions. This post-processing method is used to evaluate the detector’s robustness to variations in the spatial content of the image. The results in Fig. 6 are after we randomly crop the image with target dimension of $244\times 244$ .

Rotation: Rotates the image within a specified angle limit. This post-processing method is used to evaluate the detector’s robustness to changes in the orientation of objects within the image. The results in Fig. 6 are after we randomly rotate the image within a range of -45 to 45 degrees.

B.6 Additional Fairness Robustness Evaluation Results

Fig. B.2 to Fig. B.6 demonstrate detectors’ robustness analysis in more detail as a function of different degrees of post-processing. Overall, ViT-B/16 [63] and UnivFD [67] show stronger robustness to various post-processing methods compared to other detection methods. Fairness-enhanced detectors do not have robustness against post-processing; this would be a direction for future studies to work on. Figure B.2 presents a detailed robustness analysis in terms of utility and fairness under varying degrees of JPEG compression. The utility of all detectors decreases as image quality is reduced. Among the detectors, UnivFD [67] exhibits the highest utility robustness, while ViT-B/16 [63] demonstrates the strongest fairness robustness. When considering Gaussian blur, ViT-B/16 stands out as the most robust detector in terms of utility, whereas EfficientB4 [62] shows the greatest robustness in terms of fairness. Against Hue Saturation Value adjustments, DAW-FDD [20] shows the strongest utility robustness, while UnivFD excels in fairness robustness. ViT-B/16 demonstrates superior robustness in both utility and fairness when facing rotations. For brightness contrast variations, DAG-FDD [20] is the most robust detector in terms of utility, while UnivFD once again shows superior robustness in terms of fairness.

B.7 Additional Fairness Generalization Evaluation Results

We conduct additional generalization experiments by using models trained on FF++ [2] to evaluate their generalization performance on our AI-Face test set. For these experiments, we utilize the trained weights and intra-domain performance metrics provided by [16]. Consequently, only the detectors with the pre-trained weights available from [16] are evaluated on our AI-Face test set. Results are shown in Table 6. We report the detailed performance on generation category subsets (i.e., Deepfake Videos, GANs, and DMs) and the overall performance on the whole test set. We observe that detectors exhibit significant performance degradation, approaching coin-toss performance when trained on FF++ and tested on our AI-Face test set. This suggests that detectors trained solely on one deepfake video dataset is not sufficient for detecting face images generated by current more advanced generation models. This also highlights the significance of our AI-Face dataset, which is extensive, diverse and comprehensive in generation methods to develop and evaluate existing AI face detectors. The lowest performance is observed with GANs, likely due to the higher variety of generation methods within this category. Conversely, performance on the Deepfake Videos subset is relatively better. This could be because, despite being different datasets, the deepfake videos may share similar generation methods, resulting in less variation in the artifacts present in the generated images.

Type

Detector

Intra- Domain (FF++)

Cross-Domain (Ours w/o FF++) Test Subset

Cross-Domain (Ours w/o FF++) Whole Test Set

Deepfake

Videos (3)

GANs (10)

DMs (8)

AUC

F_{EO}

AUC

F_{EO}

AUC

F_{EO}

AUC

F_{EO}

AUC

Naive

Xception [61]

96.370

104.961

77.766

139.963

58.228

110.977

78.622

101.194

72.649

EfficientB4 [62]

95.670

110.626

76.612

148.656

44.501

88.420

73.426

94.609

65.323

Frequency

F3Net [64]

96.350

74.828

74.328

93.278

39.127

89.927

75.480

68.299

65.149

SPSL [65]

96.100

97.558

77.766

141.029

40.100

91.837

58.919

123.534

55.483

SRM [66]

95.760

60.855

74.900

89.903

57.572

73.209

77.954

57.775

72.474

Spatial

UCF [16]

97.050

102.798

77.650

122.485

40.477

95.657

77.568

79.479

67.708

CORE [68]

96.380

69.717

76.506

95.727

45.549

79.161

82.112

72.424

70.662

Table 36: Fairness and utility cross-domain evaluation. All detectors are trained on FF++ (model weights and AUC on FF++ test set are from [25]) and evaluated on our Demographically Annotated AI-Face. The best-performing method is highlighted in red.

B.8 Full Results of Effect of Increasing the Size of Train Set

In this section, we provide the full evaluation results tested under different sizes of train set, as shown from Table 37 to Table 40. Intersection $F_{EO}$ and AUC align with the results in Fig. 7 of the submitted manuscript.

Model Type

Naive

Frequency

Spatial

Fairness-enhanced

Dataset Size

Attribute

Metric

Xception

[61]

EfficientB4

[62]

ViT-B/16

[63]

F3Net

[64]

SPSL

[65]

SRM

[66]

UCF

[16]

UnivFD

[67]

CORE

[68]

DAW-FDD

[20]

DAG-FDD

[20]

PG-FDD

[21]

20%

Gender

F_{MEO}

1.725

0.366

0.863

1.523

0.916

1.818

0.652

0.369

2.196

1.657

1.549

1.428

F_{DP}

1.944

2.239

2.618

2.083

2.305

1.586

2.811

2.317

1.543

1.823

2.269

2.106

F_{OAE}

1.076

0.419

0.906

1.057

0.775

0.800

0.617

0.620

1.076

0.950

1.145

0.904

F_{EO}

2.030

0.386

1.635

1.945

1.280

2.081

0.768

0.629

2.214

1.738

2.244

1.742

Race

F_{MEO}

14.155

11.039

10.108

13.887

12.235

15.231

11.756

14.625

16.804

16.116

12.021

12.645

F_{DP}

23.488

20.018

22.360

23.285

22.782

22.998

22.994

22.628

25.752

23.457

23.093

22.572

F_{OAE}

5.266

5.286

5.057

5.416

4.807

5.425

5.063

6.459

5.913

5.009

4.877

4.676

F_{EO}

24.015

19.947

25.662

25.293

22.940

23.207

24.837

28.765

29.623

22.162

22.625

21.318

Age

F_{MEO}

6.766

5.613

5.335

7.254

5.765

6.506

8.761

5.411

7.208

5.948

5.672

5.769

F_{DP}

5.086

5.581

6.666

5.089

5.561

4.659

6.170

6.073

4.556

5.080

5.291

5.337

F_{OAE}

3.784

3.177

4.958

3.745

3.493

3.435

4.491

4.692

4.209

4.183

3.159

3.242

F_{EO}

9.533

9.157

12.476

9.632

9.222

9.203

11.928

14.228

10.548

9.699

8.470

8.339

Intersection

F_{MEO}

17.912

12.056

14.781

17.613

14.966

19.221

14.360

17.533

20.977

19.466

15.288

15.734

F_{DP}

25.299

22.237

23.053

25.005

23.895

25.807

23.863

23.563

27.720

25.542

24.273

24.374

F_{OAE}

8.001

9.313

8.647

7.898

7.506

6.137

8.856

11.806

8.713

5.859

7.538

5.378

F_{EO}

54.208

45.790

54.752

56.299

50.295

52.526

55.119

66.894

63.986

44.272

49.137

45.127

ACC

95.175

94.292

93.972

94.913

95.084

95.534

95.249

90.810

94.835

94.996

95.243

95.602

AUC

98.620

99.055

98.765

98.284

98.851

98.026

98.728

96.404

98.237

98.403

98.731

98.533

98.805

99.325

99.132

98.441

99.083

98.353

98.931

97.227

98.410

98.578

98.980

98.695

EER

5.563

5.208

6.267

6.142

5.292

5.489

4.933

10.001

6.169

5.696

5.424

5.148

Table 37: Detailed fairness and utility evaluation results on 20% training subset.

Model Type

Naive

Frequency

Spatial

Fairness-enhanced

Dataset Size

Attribute

Metric

Xception

[61]

EfficientB4

[62]

ViT-B/16

[63]

F3Net

[64]

SPSL

[65]

SRM

[66]

UCF

[16]

UnivFD

[67]

CORE

[68]

DAW-FDD

[20]

DAG-FDD

[20]

PG-FDD

[21]

40%

Gender

F_{MEO}

1.771

2.562

1.588

1.383

1.277

0.567

1.191

0.752

1.465

1.034

1.303

1.41

F_{DP}

1.801

3.841

1.948

2.33

2.088

1.955

2.715

2.756

2.113

2.362

2.128

2.998

F_{OAE}

0.908

0.439

0.881

1.117

0.661

0.078

1.281

0.625

0.971

0.811

0.793

1.193

F_{EO}

1.799

3.236

1.809

2.034

1.415

1.095

2.286

1.023

1.796

1.474

1.51

2.263

Race

F_{MEO}

14.7

12.688

9.731

14.333

10.203

7.959

14.511

14.169

14.04

11.7

12.504

7.79

F_{DP}

23.675

21.948

22.57

23.424

22.165

21.403

25.546

23.264

22.811

21.994

22.856

21.024

F_{OAE}

5.079

6.318

4.774

4.986

4.49

3.571

5.708

6.282

5.222

3.819

4.448

4.043

F_{EO}

23.443

17.727

22.852

22.917

20.787

17.734

30.682

30.21

22.288

17.342

20.633

18.703

Age

F_{MEO}

7.594

3.343

4.055

6.859

5.051

4.126

6.145

5.676

6.85

5.874

6.46

3.48

F_{DP}

4.951

6.723

5.222

5.421

5.485

4.937

5.55

6.471

5.272

5.672

5.447

6.276

F_{OAE}

3.873

1.951

2.928

3.709

3.057

2.589

3.747

4.713

3.447

3.655

3.689

3.158

F_{EO}

9.596

9.58

8.258

8.736

8.461

8.457

9.256

14.995

8.374

9.236

9.119

8.222

Intersection

F_{MEO}

18.307

20.275

14.911

17.454

12.641

12.131

19.386

17.922

17.346

13.83

15.213

11.211

F_{DP}

25.685

24.437

23.109

24.801

23.725

22.091

26.683

23.444

24.706

23.145

24.355

21.662

F_{OAE}

5.814

10.624

8.96

5.964

5.707

6.691

10.527

11.615

5.902

4.63

5.04

6.579

F_{EO}

48.63

41.478

49.564

47.936

44.562

37.57

63.173

66.301

49.171

34.392

43.13

40.847

ACC

95.796

94.03

94.822

95.844

95.794

95.393

94.754

90.711

95.984

95.975

96.257

95.337

AUC

98.696

98.932

99.024

98.851

99.064

98.722

98.306

96.371

98.824

98.974

98.949

99.092

98.778

99.269

99.31

98.959

99.236

98.968

98.588

97.224

98.984

99.139

99.035

99.318

EER

5.027

5.442

5.474

5.002

4.574

4.77

6.037

10.044

5.009

4.567

4.285

4.729

Table 38: Detailed fairness and utility evaluation results on 40% training subset.

Model Type

Naive

Frequency

Spatial

Fairness-enhanced

Dataset Size

Attribute

Metric

Xception

[61]

EfficientB4

[62]

ViT-B/16

[63]

F3Net

[64]

SPSL

[65]

SRM

[66]

UCF

[16]

UnivFD

[67]

CORE

[68]

DAW-FDD

[20]

DAG-FDD

[20]

PG-FDD

[21]

60%

Gender

F_{MEO}

0.801

0.489

2.239

1.576

1.787

1.179

0.745

0.539

0.907

1.512

1.527

0.408

F_{DP}

2.899

2.747

4.223

1.996

1.638

3.062

2.596

2.509

2.291

3.01

1.88

2.857

F_{OAE}

0.692

0.109

0.935

0.904

0.802

0.361

0.783

0.634

0.697

1.296

0.716

0.27

F_{EO}

1.136

0.677

3.474

1.757

2.002

1.22

1.328

0.586

1.153

2.435

1.594

0.547

Race

F_{MEO}

8.652

16.885

6.328

13.433

16.19

6.243

9.96

14.482

14.243

9.849

14.223

5.96

F_{DP}

21.794

26.205

14.609

23.519

21.498

18.874

20.947

23.469

24.031

20.453

23.547

15.86

F_{OAE}

3.781

6.63

5.65

4.671

5.716

4.346

4.133

6.328

4.569

3.77

5.247

3.746

F_{EO}

18.942

23.67

21.707

22.128

23.478

12.107

15.885

29.99

22.562

14.96

22.213

14.735

Age

F_{MEO}

5.047

5.153

5.719

6.155

4.154

3.81

4.71

5.243

5.789

5.512

3.699

3.553

F_{DP}

6.012

4.411

7.664

5.157

5.456

6.02

6.042

6.245

5.444

7.926

5.023

6.488

F_{OAE}

2.916

2.496

4.283

3.316

3.897

2.374

2.752

4.555

3.422

3.886

2.858

2.244

F_{EO}

7.607

10.635

13.662

8.084

9.09

7.503

6.872

14.282

8.089

8.321

6.951

8.124

Intersection

F_{MEO}

10.466

25.134

7.982

16.425

17.532

10.693

12.44

17.613

16.374

12.417

17.272

9.574

F_{DP}

22.891

27.88

18.106

25.118

24.338

20.236

22.678

23.819

25.063

22.277

25.459

18.176

F_{OAE}

5.884

11.229

7.443

5.547

6.822

7.714

4.749

11.612

5.287

5.726

5.873

5.899

F_{EO}

39.873

51.509

46.548

46.511

52.673

28.055

35.888

66.884

45.261

30.682

47.626

31.167

ACC

96.505

93.931

93.612

96.221

95.676

96.51

96.567

90.882

96.009

95.025

96.332

96.488

AUC

98.97

98.828

98.536

99.075

99.102

99.236

99.026

96.461

99.189

99.003

99.354

99.401

98.987

99.195

98.953

99.17

99.234

99.415

99.012

97.279

99.351

99.285

99.503

99.461

EER

3.829

6.004

6.668

4.322

4.314

3.248

3.592

9.875

4.351

5.072

3.882

3.583

Table 39: Detailed fairness and utility evaluation results on 60% training subset.

Model Type

Naive

Frequency

Spatial

Fairness-enhanced

Dataset Size

Attribute

Metric

Xception

[61]

EfficientB4

[62]

ViT-B/16

[63]

F3Net

[64]

SPSL

[65]

SRM

[66]

UCF

[16]

UnivFD

[67]

CORE

[68]

DAW-FDD

[20]

DAG-FDD

[20]

PG-FDD

[21]

80%

Gender

F_{MEO}

1.753

0.256

1.697

0.976

1.199

0.166

1.235

0.447

1.428

0.398

0.526

0.339

F_{DP}

1.925

2.648

1.891

2.861

3.002

2.642

3.511

2.461

1.881

2.762

2.695

2.643

F_{OAE}

0.943

0.002

0.988

0.280

0.474

0.316

0.737

0.596

0.665

0.214

0.408

0.172

F_{EO}

1.893

0.364

1.910

1.214

1.489

0.218

1.237

0.467

1.495

0.522

0.788

0.384

Race

F_{MEO}

11.908

11.806

9.589

4.724

3.751

8.864

2.988

14.911

13.396

3.892

5.036

2.891

F_{DP}

22.332

21.476

19.620

18.520

17.354

18.431

16.783

23.411

22.809

16.666

18.631

17.598

F_{OAE}

4.298

6.127

4.724

3.890

3.573

4.030

2.667

6.322

4.970

3.282

4.350

2.731

F_{EO}

18.793

19.118

20.637

10.997

9.458

14.889

10.966

29.610

22.226

9.621

13.148

8.090

Age

F_{MEO}

5.554

4.823

4.219

2.168

3.355

2.588

1.699

5.731

6.528

2.822

1.498

1.076

F_{DP}

5.307

5.675

5.586

6.111

6.492

6.159

6.884

6.433

4.840

6.781

5.943

5.842

F_{OAE}

3.001

2.397

3.365

1.221

1.905

1.389

0.832

4.916

3.150

2.718

1.133

0.744

F_{EO}

7.274

8.476

8.710

4.026

8.252

5.533

5.139

15.746

8.380

7.114

4.174

2.835

Intersection

F_{MEO}

14.979

17.336

11.294

6.650

6.863

9.372

5.369

18.159

16.769

5.729

8.210

5.443

F_{DP}

24.220

21.943

21.145

21.258

20.254

20.920

19.077

24.033

24.954

18.015

20.556

19.798

F_{OAE}

5.025

10.608

7.908

6.697

5.709

6.343

5.118

11.541

5.558

5.760

7.955

4.583

F_{EO}

40.744

44.028

45.684

27.249

22.360

32.012

24.504

66.750

46.492

21.906

29.401

17.687

ACC

96.629

94.917

94.904

95.309

96.461

96.548

97.736

90.898

95.586

95.808

97.317

98.277

AUC

99.361

98.788

99.143

99.409

99.597

99.682

99.753

96.501

98.440

99.419

99.739

99.860

99.429

99.051

99.403

99.523

99.653

99.765

99.801

97.308

98.562

99.589

99.817

99.874

EER

3.538

5.198

5.189

3.894

3.138

2.707

2.276

9.817

5.470

4.259

2.745

1.738

Table 40: Detailed fairness and utility evaluation results on 80% training subset.

B.9 Fairness and Utility Trade-off

Fig. B.7 presents the trade-offs between $F_{DP}$ on age and AUC of three fairness-enhanced methods. This is to analyze how well these methods balance optimizing utility and ensuring fairness in decision-making. 1) PG-FDD [21] achieves the best utility-fairness trade-off overall. It improves fairness without compromising the precision of utility, maintaining high accuracy in detection. For instance, PG-FDD achieves a higher AUC than DAW-FDD and DAG-FDD while maintaining comparable fairness metrics. 2) DAW-FDD [20] is sensitive to the hyperparameter that balances utility-fairness. For example, when its fairness approaches to zero, its utility also drops to a coin-tossing performance. This sensitivity can hinder practical deployment, as extensive tuning is required to optimize performance. 3) To ensure broader applicability and reliability, future fairness approaches should aim to minimize sensitivity to hyperparameter settings.

Appendix C Datasheet for AI-Face

In this section, we present a DataSheet [87] for AI-Face.

C.1 Motivation For Dataset Creation

•

Why is the dataset created? For researchers to evaluate the fairness of AI face detection models or to train fairer models. Please see Section 2 ‘Background and Motivation’ in the submitted manuscript.
•

Has the dataset been used already? Yes. Our fairness benchmark is based on this dataset.
•

What (other) tasks could the dataset be used for? Could be used as training data for generative methods attribution task.
•

Who funded dataset creation? This work is supported by the U.S. National Science Foundation (NSF) under grant IIS-2348419 and the National Artificial Intelligence Research Resource (NAIRR) Pilot and TACC Lonestar6. Please see Acknowledgment Acknowledgment.

C.2 Data Composition

•

What are the instances? The instances that we consider in this work are real face images and AI-generated face images from public datasets.
•

How many instances are there? We include more than 2 million face images from public datasets. Please see Table 13 for details.
•

What data does each instance consist of? Each instance consists of an image.
•

Is there a label or target associated with each instance? Each image is associated with uncertainty score for gender prediction, uncertainty score for age prediction, uncertainty score for race prediction, gender annotation, age annotation, race annotation, and target label (fake or real).
•

Is any information missing from individual instances? No.
•

Are relationships between individual instances made explicit? Not applicable – we do not study the relationship between each image.
•

Does the dataset contain all possible instances or is it a sample? Contains all instances our curation pipeline collected. Since the current dataset does not cover all available images online, there is a high probability more instances can be collected in the future.
•

Are there recommended data splits (e.g., training, development/validation, testing)? For detector development and training, the dataset can be split as 6:2:2.
•

Are there any errors, sources of noise, or redundancies in the dataset? If so, please provide a description. Yes. Despite our extensive efforts to reduce demographic label noise, including human corrections based on uncertainty scores, there may still be mislabeled instances. Given the dataset’s size of over 2 million images, it is impractical for humans to manually check and correct each image individually.
•

Is the dataset self-contained, or does it link to or otherwise rely on external resources (e.g., websites, tweets, other datasets)? The dataset is self-contained.

C.3 Collection Process

•

What mechanisms or procedures were used to collect the data? We build our AI-Face dataset by collecting and integrating public AI-generated face images sourced from academic publications, GitHub repositories, and commercial tools. Please see ‘Data Collection’ in Section 3.2
•

How was the data associated with each instance acquired? Was the data directly observable (e.g., raw text, movie ratings), reported by subjects (e.g., survey responses), or indirectly inferred/derived from other data? The data can be acquired after our verification of user submitted and signed EULA.
•

If the dataset is a sample from a larger set, what was the sampling strategy (e.g., deterministic, probabilistic with specific sampling probabilities)? Not applicable. We did not sample data from a larger set. But we use RetinaFace [60] for detecting and cropping faces to ensure each image only contains one face.
•

Over what timeframe was the data collected? Does this timeframe match the creation timeframe of the data associated with the instances (e.g., recent crawl of old news articles)? If not, please describe the timeframe in which the data associated with the instances was created. The data was collected from February 2024 to April 2024, even though the data were originally released before this time. Please refer to the cited papers in Table 13 for specific original data released time.

C.4 Data Processing

•

Was any preprocessing/cleaning/labeling of the data done (e.g., discretization or bucketing, tokenization, part-of-speech tagging, SIFT feature extraction, removal of instances, processing of missing values)? Yes. We discussed in ‘Demographically Annotation Generation’ in Section 3.2.
•

Was the ‘raw’ data saved in addition to the preprocessed/cleaned/labeled data (e.g., to support unanticipated future uses)? If so, please provide a link or other access point to the ‘raw’ data. The ‘raw’ data can be acquired through the original data publisher. Please see the cited papers in Table 13.
•

Is the software used to preprocess/clean/label the instances available? If so, please provide a link or other access point. Yes. We use RetinaFace [60] for detecting and cropping faces to ensure each image only contains one face. Demographic annotations are given by our annotator, see ‘Annotator Development’ in Section 3.1. Our annotator code is available on Our GitHub repository.
•

Does this dataset collection/processing procedure achieve the motivation for creating the dataset stated in the first section of this datasheet? If not, what are the limitations? Yes. The dataset does allow for the study of our goal, as it covers comprehensive generation methods, demographic annotations for evaluating current detectors and training fairer detectors.

C.5 Dataset Distribution

•

How will the dataset be distributed? We distribute all the data as well as CSV files that formatted all annotations of images under the CC BY-NC-ND 4.0 license and strictly for research purposes.
•

When will the dataset be released/first distributed? What license (if any) is it distributed under? The data has been released, under the permissible CC BY-NC-ND 4.0 license for research-based use only. Users can access our dataset by submitting an EULA. Dataset license and EULA is on our GitHub https://github.com/Purdue-M2/AI-Face-FairnessBench.
•

Are there any copyrights on the data? We believe our use is ‘fair use’ since all data in our dataset is collected from public datasets.
•

Are there any fees or access restrictions? No.

C.6 Dataset Maintenance

•

Who is supporting/hosting/maintaining the dataset? The first author of this paper.
•

Will the dataset be updated? If so, how often and by whom? We do not plan to update it at this time.
•

Is there a repository to link to any/all papers/systems that use this dataset? Not right now, but we encourage anyone who uses the dataset to cite our paper so it can be easily found. Our fairness benchmark uses this dataset, the code of fairness benchmark is on our GitHub https://github.com/Purdue-M2/AI-Face-FairnessBench.
•

If others want to extend/augment/build on this dataset, is there a mechanism for them to do so? Not at this time.

C.7 Legal and Ethical Considerations

•

Were any ethical review processes conducted (e.g., by an institutional review board)? No official processes were done since all data in our dataset were collected from the existing public datasets.
•

Does the dataset contain data that might be considered confidential? No. We only use data from public datasets.
•

Does the dataset contain data that, if viewed directly, might be offensive, insulting, threatening, or might otherwise cause anxiety? If so, please describe why No. It is a face image dataset, we have not seen any instance of offensive or abusive content.
•

Does the dataset relate to people? Yes. It is a face image dataset containing real face images and AI-generated face images.
•

Does the dataset identify any subpopulations (e.g., by age, gender)? Yes, through demographic annotations.
•

Is it possible to identify individuals (i.e., one or more natural persons), either directly or indirectly (i.e., in combination with other data) from the dataset? Yes. It is a face image dataset. The age, gender, and race can be identified through the face image, also through the demographic annotation we provide. All of the images that we use are from publicly available data.

C.8 Author Statement and Confirmation of Data License

The authors of this work declare that the dataset described and provided has been collected, processed, and made available with full adherence to all applicable ethical guidelines and regulations. We accept full responsibility for any violations of rights or ethical guidelines that may arise from the use of this dataset. We also confirm that the dataset is released under the CC BY-NC-ND 4.0 license, permitting sharing and downloading of the work in any medium, provided the original author is credited, and it is used non-commercially with no derivative works created.