1000
AI-Face: A Million-Scale Demographically Annotated AI-Generated Face Dataset and Fairness Benchmark
Abstract
AI-generated faces have enriched human life, such as entertainment, education, and art. However, they also pose misuse risks. Therefore, detecting AI-generated faces becomes crucial, yet current detectors show biased performance across different demographic groups. Mitigating biases can be done by designing algorithmic fairness methods, which usually require demographically annotated face datasets for model training. However, no existing dataset comprehensively encompasses both demographic attributes and diverse generative methods, which hinders the development of fair detectors for AI-generated faces. In this work, we introduce the AI-Face dataset, the first million-scale demographically annotated AI-generated face image dataset, including real faces, faces from deepfake videos, and faces generated by Generative Adversarial Networks and Diffusion Models. Based on this dataset, we conduct the first comprehensive fairness benchmark to assess various AI face detectors and provide valuable insights and findings to promote the future fair design of AI face detectors. Our AI-Face dataset and benchmark code are publicly available at https://github.com/Purdue-M2/AI-Face-FairnessBench.
1 Introduction
AI-generated faces are created using sophisticated AI technologies that are visually difficult to discern from real ones [1]. They can be summarized into three categories: deepfake videos [2] created by typically using Variational Autoencoders (VAEs) [3, 4], faces generated from Generative Adversarial Networks (GANs) [5, 6, 7, 8], and Diffusion Models (DMs) [9]. These technologies have significantly advanced the realism and controllability of synthetic facial representations. Generated faces can enrich media and increase creativity [10]. However, they also carry significant risks of misuse. For example, during the 2024 United States presidential election, fake face images of Donald Trump surrounded by groups of black people smiling and laughing to encourage African Americans to vote Republican are spreading online [11]. This could distort public opinion and erode people’s trust in media [12, 13], necessitating the detection of AI-generated faces for their ethical use.
However, one major issue existing in current AI face detectors [14, 15, 16, 17] is biased detection (i.e., unfair detection performance among demographic groups [18, 19, 20, 21]). Mitigating biases can be done by designing algorithmic fairness methods, but they usually require demographically annotated face datasets for model training. For example, works like [20, 21] have made efforts to enhance fairness in the detection based on A-FF++ [19] and A-DFD [19]. However, both datasets are limited to containing only faces from deepfake videos, which could cause the trained models not to be applicable for fairly detecting faces generated by GANs and DMs. Although a few datasets (e.g., GenData [22]) cover GAN and DM faces, their demographic annotations are not comprehensive. Most importantly, no existing dataset is diverse enough in generation methods to develop AI face detectors that can cope with rapidly evolved generative models. These limitations of existing datasets hamper the development of fair technologies for detecting AI-generated faces.
Moreover, benchmarking fairness provides a direct method to uncover prevalent and unique fairness issues in recent AI-generated face detection. However, there is a lack of a comprehensive benchmark to estimate the fairness of existing AI face detectors. Existing benchmarks [23, 24, 25, 26] primarily assess utility, neglecting systematic fairness evaluation. One study [18] does evaluate fairness in detection models, but their examination is only based on deepfake video datasets using a few outdated detectors. Detectors’ fairness performance on GAN faces and DM faces has not been extensively explored. The absence of a comprehensive fairness benchmark impedes a thorough understanding of the fairness behaviors of recent AI face detectors and obscures the research path for detector fairness guarantees.
In this work, we build the first million-scale demographically annotated AI-generated face image dataset: AI-Face (see Fig. 1). The face images are collected from various public datasets, including the real faces that are usually used to train AI face generators, faces from deepfake videos, and faces generated by GANs and DMs. Each face is demographically annotated with an uncertainty score on each predicted demographic attribute by our designed Contrastive Language-Image Pretraining (CLIP) [27]-based lightweight annotator. To improve the quality of annotations, we recruit three humans to correct annotations with high uncertainty scores manually. Next, we conduct the first comprehensive fairness benchmark on our dataset to estimate the fairness performance of 12 representative detectors coming from four model types. Our benchmark exposes common and unique fairness challenges in recent AI face detectors, providing essential insights that can guide and enhance the future design of fair AI face detectors. Our contributions are as follows:
-
•
We build the first comprehensive million-scale demographically annotated AI-generated face Dataset by leveraging our developed lightweight annotator with human correction.
-
•
We conduct the first comprehensive fairness benchmark of AI-generated face detectors, providing an extensive fairness assessment of current representative detectors.
-
•
Based on our experiments and observations, we summarize the unsolved questions and offer valuable insights within this research domain, setting the stage for future investigations.
2 Background and Motivation
AI-generated Faces and Biased Detection. AI-generated face images, created by advanced AI technologies, are visually difficult to discern from real ones, see Fig. 1. They can be summarized into three categories: 1) Deepfake Videos. Initiated in 2017 [13], these use face-swapping techniques with a variational autoencoder to replace a face in a target video with one from a source [3, 4]. Note that our paper focuses solely on images extracted from videos. 2) GAN-generated Faces. Post-2017, Generative Adversarial Networks (GANs) [28] like StyleGANs [6, 7, 8] have significantly improved generated face realism. 3) DM-generated Faces. Diffusion models (DMs), emerging in 2021, generate detailed faces from textual descriptions and offer greater controllability. Tools like Midjourney [29] and DALLE2 [30] facilitate customized face generation. While these AI-generated faces can enhance visual media and creativity [10], they also pose risks, such as being misused in social media profiles [31, 32]. Therefore, numerous studies focus on detecting AI-generated faces [14, 15, 16, 17], but current detectors often show performance disparities among demographic groups like race and gender [18, 19, 20, 21]. This bias can lead to unfair targeting or exclusion, undermining trust in detection models. Recent efforts [20, 21] aim to enhance fairness in deepfake detection but mainly address deepfake videos, overlooking biases in detecting GAN and DM-generated faces.
Dataset | Face Images | Generation Category | #Generation Methods | Source of Real Images | Demographic Annotation | ||||||||
#Real | #Fake | Deepfake Videos | GAN | DM | Gender | Race | Age | ||||||
A-FF++ [19] | 29.8K | 149.1K | ✓ | 5 | YouTube | ✓ | ✓ | ✓ | |||||
A-DFD [19] | 10.8K | 89.6K | ✓ | 5 | Self-Recording | ✓ | ✓ | ✓ | |||||
A-DFDC [19] | 54.5K | 52.6K | ✓ | ✓ | 8 | Self-Recording | ✓ | ✓ | ✓ | ||||
A-Celeb-DF-v2 [19] | 26.3K | 166.5K | ✓ | 1 | Self-Recording | ✓ | ✓ | ||||||
A-DF-1.0 [19] | 870.3K | 321.5K | ✓ | 1 | Self-Recording | ✓ | ✓ | ✓ | |||||
DF-1.0 [33] | 2.9M | 14.7M | ✓ | 1 | Self-Recording | ✓ | ✓ | ||||||
DeePhy [34] | 1K | 50.4K | ✓ | ✓ | 3 | YouTube | ✓ | ✓ | |||||
DF-Platter [35] | 392.3K | 653.4K | ✓ | ✓ | 3 | YouTube | ✓ | ✓ | |||||
GenData [22] | - | 20K | ✓ | ✓ | 3 | CelebA [36] | ✓ | ||||||
Ours | 866K | 1.2M | ✓ | ✓ | ✓ | 37 |
|
✓ | ✓ | ✓ |
The Related Existing Datasets. Current AI-generated facial datasets with demographic annotations are limited in size, generation categories, methods, and annotations, as illustrated in Table 1. For instance, A-FF++, A-DFD, A-DFDC, and A-Celeb-DF-v2 [19] are deepfake video datasets with fewer than one million images. Datasets like DF-1.0 [33] and DF-Platter [35] lack comprehensive demographic annotations. Additionally, existing datasets offer limited generation methods. These limitations hinder the development of fairer AI face detectors, motivating us to build a million-scale demographically annotated AI-Face dataset.
Existing Benchmarks | Category | Scope of Benchmark | |||
Deepfake Videos | GAN | DM | Utility | Fairness | |
DeepfakeBench [25] | ✓ | ✓ | ✓ | ||
Lin et al. [24] | ✓ | ✓ | ✓ | ||
Le et al. [26] | ✓ | ✓ | ✓ | ||
CDDB [23] | ✓ | ✓ | |||
Loc et al. [18] | ✓ | ✓ | ✓ | ||
Ours | ✓ | ✓ | ✓ | ✓ | ✓ |
Benchmark for Detecting AI-generated Faces. Benchmarks are essential for evaluating AI-generated face detectors under standardized conditions. Existing benchmarks, as shown in Table 2, mainly focus on detectors’ utility, often overlooking fairness [23, 24, 25, 26]. Only Loc et al. [18] examined detector fairness. However, their study focused only on deepfake video datasets, not on GAN- and DM-generated faces. This motivates us to conduct a comprehensive benchmark to evaluate AI face detectors’ fairness.
3 The Demographically Annotated AI-Face Dataset
To address the prohibitive time consuming of manual annotation, we introduce two phases to build our dataset: Annotator Development and Demographically Annotation Generation, as shown in Fig. 2.
3.1 Phase 1: Annotator Development
Problem Definition. There are existing online software (e.g., Face++ [42]) and open-source tools (e.g., InsightFace [43]) for face attribute prediction. However, they fall short of our task due to two reasons: 1) They are mostly designed for face recognition and trained on datasets of real face images but lack generalization capability for annotating AI-generated face images. 2) They do not provide uncertainty scores for their predictions that can be used to identify mispredicted samples for further annotation correction. Given a training dataset with size , where , , , and represent the -th face image, and its gender, age, and race labels/attributes, respectively. Our goal is to design a lightweight, generalizable annotator based on to predict facial demographic attributes with uncertainty scores for each face image in our dataset.
Annotator. Architecture: We utilize CLIP [27] for its strong zero-shot and few-shot learning capabilities. Leveraging CLIP’s pre-training on diverse datasets, we create a lightweight annotator for facial images. Our annotator employs a frozen pre-trained CLIP ViT L/14 [44] as a feature extractor followed by a trainable 3-layer Multilayer Perceptron (MLP) as a multi-task (i.e., gender, age, and race prediction) classifier parameterized by . Loss: For each image , its feature is obtained through and then is fed into the MLP multi-task classifier with conventional classification losses for face attribute prediction. The learning objective is formulated as: , where represents the (binary) cross-entropy (CE) loss. , , and represent the classification heads for gender, age, and race, respectively. Optimization: Traditional optimization methods like stochastic gradient descent can lead to poor model generalization due to sharp loss landscapes with multiple local and global minima. To address this, we use Sharpness-Aware Minimization (SAM) [45] to enhance our annotator’s generalization by flattening the loss landscape. Specifically, flattening is attained by determining the optimal for perturbing model parameters to maximize the loss, formulated as: , where controls the perturbation magnitude. This is approximated using a first-order Taylor expansion, assuming is small. The final equation is obtained by solving a dual norm problem, where sign represents a sign function and being the gradient of with respect to . As a result, the model parameters are updated by solving: .
Uncertainty Estimation. Although the high prediction performance of our annotator can be obtained, the labels may still be mispredicted due to the ambiguity of the face images (see an example in Fig. 3). Therefore, it is crucial to provide an uncertainty score for each prediction from the annotator. To this end, inspired by [46], we incorporate dropout techniques at each layer of MLP for uncertainty estimation in testing. This involves performing stochastic forward passes for a given test image , each with a unique dropout pattern. So, we can obtain distinct softmax outputs for each demographic attribute , denoted as . Then, the uncertainty score for on image (denoted as ) is calculated as , where is a user-defined parameter to counterweight the measure of centrality (i.e., the first term in indicates the likelihood of the prediction being correct) and dispersion (i.e., the second term in reflects the consensus among the stochastic outputs).
Evaluation. To demonstrate our annotator’s effectiveness, we will answer the following questions: Q1: How are the general performance and generalization capability of our annotator compared with the baselines? Q2: How does sample difficulty affect the annotator’s performance? In leveraging the good generalization capabilities of CLIP, our annotator is trained on the VGGFace2 [47] dataset, which contains 9.1K individuals with 3.3M images. More importantly, [48] provides comprehensive demographic annotations for this dataset. We compare our annotator with the current state-of-the-art face attribute prediction tools Face++ [42] and InsightFace [43]. Since they do not offer predictions for the race attribute, our evaluation is confined to gender and age. The mean and standard deviation are reported based on 5 random runs. More details are in Appendix A.1.1.
For Q1, Setting: We perform intra-domain (train on VGGFace2, test on its official test set) and cross-domain (train on VGGFace2, test on four AI-generated face datasets) evaluations. Specifically, A-FF++, A-DFDC, A-DFD, and A-Celeb-DF-v2 are selected from [19] for cross-domain evaluation. Since A-DFD and A-Celeb-DF-v2 have limited age and race annotations, our evaluation of these two is confined to gender. These datasets are chosen because they closely match our objective and are not used to train Face++ and InsightFace. Results: The ‘All’ results in Table 3 demonstrate our annotator’s superiority in general performance and generalization capability against Face++ and InsightFace. Under intra-domain evaluation, it surpasses the second-best method, Face++, by 5.8% on gender and 18.9% on age. In cross-domain evaluation, our annotator maintains high accuracy on all datasets, reflecting good generalization. Remarkably, on the A-FF++ dataset, our annotator outperforms Face++ by a substantial margin of up to 11.4% and InsightFace by 16.1% on age.
Level | ||||||||||||||||||||||||||||||
All | Easy | Medium | Hard | |||||||||||||||||||||||||||
Type | Dataset | Attribute |
|
|
Ours |
|
|
Ours |
|
|
Ours |
|
|
Ours | ||||||||||||||||
76.7289 | 78.0764 | 83.8978 | 97.0133 | 97.0863 | 99.7333 | 74.2400 | 75.356 | 87.5333 | 58.9333 | 61.787 | 64.4267 | |||||||||||||||||||
Gender | (0.4985) | (0.4266) | (0.3697) | (0.1293) | (0.3414) | (0.1265) | (0.8182) | (0.5938) | (0.5007) | (0.5481) | (0.3445) | (0.4818) | ||||||||||||||||||
54.4311 | 58.4889 | 77.4044 | 68.000 | 73.0067 | 98.4133 | 49.8000 | 53.2467 | 78.9733 | 45.4933 | 49.2134 | 54.8267 | |||||||||||||||||||
Intra- Domain | VGGFace2 [47] | Age | (0.7443) | (0.7341) | (0.6714) | (0.5530) | (0.6534) | (0.1543) | (0.6613) | (0.8465) | (1.0771) | (1.0186) | (0.7025) | (0.7827) | ||||||||||||||||
84.9733 | 89.1714 | 91.3000 | 96.8267 | 98.0528 | 98.9333 | 88.0933 | 94.3074 | 98.8667 | 70.0000 | 75.1539 | 76.1000 | |||||||||||||||||||
Gender | (0.4651) | (0.1974) | (0.2058) | (0.3832) | (0.1483) | (0.0943) | (0.4668) | (0.2586) | (0.1033) | (0.5452) | (0.1854) | (0.4197) | ||||||||||||||||||
59.4867 | 64.1893 | 75.5393 | 71.254 | 80.5980 | 93.1980 | 58.1720 | 63.8340 | 81.1960 | 49.0340 | 48.1360 | 52.2240 | |||||||||||||||||||
A-FF++ [19] | Age | (0.9291) | (0.7609) | (0.5130) | (0.5973) | (0.4140) | (0.3110) | (0.4489) | (0.6733) | (0.4702) | (1.7410) | (1.1954) | (0.7577) | |||||||||||||||||
70.1111 | 76.0917 | 78.2922 | 85.8533 | 92.1414 | 96.2933 | 68.4666 | 73.5088 | 76.2005 | 56.0133 | 62.6249 | 62.3334 | |||||||||||||||||||
Gender | (0.5037) | (0.6290) | (0.5178) | (0.5239) | (0.5447) | (0.5927) | (0.5667) | (0.4910) | (0.5028) | (0.5014) | (0.8513) | (0.4580) | ||||||||||||||||||
66.6967 | 69.5907 | 77.1800 | 72.1580 | 84.5000 | 95.3820 | 64.398 | 64.238 | 73.1620 | 63.5340 | 60.034 | 62.9960 | |||||||||||||||||||
A-DFDC [19] | Age | (0.8015) | (0.5687) | (0.6300) | (0.6785) | (0.4908) | (0.3247) | (0.9182) | (0.5423) | (0.8592) | (0.8078) | (0.6730) | (0.7061) | |||||||||||||||||
66.7156 | 70.7983 | 74.9822 | 85.5467 | 88.9791 | 94.0400 | 60.5467 | 64.2144 | 70.0267 | 54.0533 | 59.2015 | 60.8800 | |||||||||||||||||||
A-DFD [19] | Gender | (0.7681) | (1.2229) | (0.6029) | (0.6791) | (0.8297) | (0.2999) | (0.9017) | (1.3436) | (0.9471) | (0.7235) | (1.4953) | (0.5616) | |||||||||||||||||
91.9244 | 90.8100 | 95.1489 | 98.9733 | 98.1867 | 99.9867 | 94.0000 | 94.4933 | 99.7600 | 82.8000 | 79.7500 | 85.7000 | |||||||||||||||||||
Cross- Domain | A-Celeb- DF-v2[19] | Gender | (0.3003) | (0.4487) | (0.4088) | (0.1769) | (0.2286) | (0.0267) | (0.3239) | (0.5052) | (0.0998) | (0.4000) | (0.6124) | (1.1000) |
For Q2, Setting: We also design a stratified evaluation method by separating each test dataset into three subsets—Easy, Medium, and Hard based on the estimated uncertainty scores. Specifically, for each demographic attribute , we define two thresholds and , where (more details are in Appendix A.1.2). Then, we have , , and . Next, we sample 1,500 images from each subset. This stratification is crucial for a thorough examination of the model’s performance across a broad spectrum of data challenges. To avoid attribute-specific biases, each subset is balanced with respect to attribute. Results: Table 3 illustrates that while all methods show decreased accuracy as the sample difficulty level increases, our annotator demonstrates greater resilience. For example, under intra-domain evaluation, our annotator’s gender performance drops by 10.2% from easy to medium difficulty, compared to Face++’s 21.7% drop. In cross-domain scenario, our annotator experiences a 14.3% reduction on gender in A-Celeb-DF-v2 [19], versus InsightFace’s 16.2% from easy to hard.
3.2 Phase 2: Demographically Annotation Generation
Data Collection. We build our AI-Face dataset by collecting and integrating public AI-generated face images sourced from academic publications, GitHub repositories, and commercial tools. More details are in Appendix A.2.1. Specifically, the fake face images in our dataset originate from 4 Deepfake Video datasets (i.e., A-FF++ [19], A-DFDC [19], A-DFC [19], and A-Celeb-DF-v2 [19]), generated by 10 GAN models (i.e., AttGAN [49], MMDGAN [50], StarGAN [49], StyleGANs [49, 51, 52], MSGGAN [50], ProGAN [53], STGAN [50], and VQGAN [54]), and 8 DM models (i.e., DALLE2 [55], IF [55], Midjourney [55], DCFace [56], Latent Diffusiin [57], Palette [58], Stable Diffusion v1.5 [59], Stable Diffusion Inpainting [59]). This constitutes a total of 1,245,660 fake face images in our dataset. These fake images are correspondingly generated from 8 real source datasets (i.e., FFHQ [6], CASIA-WebFace [37], IMDB-WIKI [38], CelebA [36], and real images from FF++ [2], DFDC [39], DFD [40], and Celeb-DF-v2 [41]). This constitutes a total of 866,096 real face images in our dataset. In general, our dataset contains 30 subsets and 37 generation methods (i.e., 5 in A-FF++, 5 in A-DFD, 8 in A-DFDC, 1 in A-Celeb-DF-v2, 10 GANs, and 8 DMs). We use RetinaFace [60] for detecting and cropping faces to ensure each image only contains one face.
Annotator Prediction. For our collected images, annotation generation is iterative, integrating uncertainty scores into each prediction by our annotator in Phase 1, as shown in Fig. 2.
Human Correction. As described in ‘Uncertainty Estimation’ in Section 3.1, the annotator may mispredict ambiguous face images, necessitating human review and correction. To this end, we propose two annotation correction strategies: 1) For subsets that have the same images and demographic attribute classes as those in existing datasets, such as A-FF++ [19] and A-DFDC [19], we filter out images that may need human correction based on annotation inconsistency.
2) For the rest of the subsets, we identify the most ambiguous images that need human correction based on uncertainty scores. Specifically, for demographic attribute on subset , we define a specific threshold (more details are in Appendix A.2.2). If , the annotation for attribute of the image will undergo a verification process, potentially requiring human re-annotation (see Fig. 3). In practice, we recruit three humans to correct the filtered images, consolidating their evaluations with a majority vote to finalize annotations.
Evaluation. To estimate our dataset’s quality, we will answer the following questions: Q1: Can we directly incorporate the existing annotations into our dataset? Q2: How is the effectiveness of human correction? Q3: How is the overall annotations’ quality of our dataset?
Gender | Age | Race | ||||||||
Type | Datasets | ACC(%) | Precision(%) | Recall(%) | ACC(%) | Precision(%) | Recall(%) | ACC(%) | Precision(%) | Recall(%) |
A-FF++ [19] | 8.0163 | 17.3354 | 5.8314 | 19.9002 | 30.6658 | 29.6071 | 28.7865 | 35.7122 | 41.1687 | |
Ours-FF++ (w/o Correction) | 91.9837 | 82.6646 | 94.1684 | 21.1830 | 32.1232 | 45.7231 | 45.9775 | 50.3803 | 40.1949 | |
A-DFDC [19] | 20.2252 | 27.5332 | 21.6538 | 16.7493 | 29.0640 | 29.5519 | 18.1115 | 15.1092 | 22.0637 | |
For Q1 | Ours-DFDC (w/o Correction) | 79.7748 | 72.4668 | 78.3462 | 45.9748 | 49.4734 | 48.7861 | 70.9001 | 64.7655 | 65.1608 |
Ours (w/o Correction) | 83.4167 | 83.4167 | 83.4242 | 43.8333 | 43.8333 | 54.1792 | 67.4167 | 65.0718 | 59.2350 | |
For Q2 | Ours | 84.8333 | 84.8738 | 84.8599 | 44.7500 | 44.0937 | 54.6033 | 68.8333 | 66.6440 | 61.3225 |
For Q3 | Ours | 98.6667 | 98.6688 | 98.6667 | 56.2500 | 50.1748 | 53.0514 | 86.2500 | 75.5216 | 67.4076 |
For Q1, Setting: We compare our dataset’s annotation quality before human correction on A-FF++ (i.e., Ours-FF++ (w/o Correction)) and A-DFDC (i.e., Ours-DFDC (w/o Correction)) against their existing annotation from [19]. We regard human re-labeled annotations as the ground truth. Results: The results in Table 4 ‘For Q1’ show superior annotation accuracy of our datasets. For example, Ours-FF++ (w/o Correction) surpasses A-FF++ by 83.97% in gender accuracy, and Ours-DFDC (w/o Correction) exceeds A-DFDC by 59.55%. The large performance indicates that identified images by annotation inconsistency are mislabeled in A-FF++ [19] and A-DFDC [19], and thus cannot be directly merged into our dataset. Some examples are shown in Appendix A.2.3.
For Q2, Setting: We consider two dataset versions: 1) Ours (w/o Correction), where annotations are not corrected by humans. 2) Ours, where annotations are corrected by humans. With the help of the uncertainty score, we sample 1,200 attribute-balanced images (400 easy, 400 medium, and 400 hard) from the whole dataset to ensure a fair evaluation. Three humans re-annotated these images to establish ground truth. Results: Table 4 ‘For Q2’ shows that human corrections improve performance across all attributes, increasing accuracy by 1.42% for gender, 0.92% for age, and 1.42% for race, validating the effectiveness of our correction strategy. More results see Appendix A.2.4.
For Q3, Setting: We randomly sample 1,200 images from the whole dataset. Three humans also re-annotated these images to create ground truth. Results: As shown in Table 4 ‘For Q3’, Ours reflects the approximate overall annotation quality of our dataset. Notably, the annotations of gender and race attributes show high correctness (e.g., 98.6667% ACC on gender and 86.2500% ACC on race). However, the age annotation shows a lower accuracy since it is challenging to differentiate.
4 Fairness Benchmark Experiments
In this section, we estimate the existing AI-generated image detectors’ fairness performance alongside their utility on our AI-Face Dataset (80%/20% for Train/Test). Our goal is to show the significance of our dataset and expose the fairness issues of recent detectors in combating AI-generated faces.
Detection Methods. Our benchmark has implemented 12 detectors, as detailed in Appendix B.1. The methodologies cover a spectrum that is specifically tailored to detect AI-generated faces from Deepfake Videos, GANs, and DMs. They can be classified into four types: Naive detectors: refer to backbone models that can be directly utilized as the detector for binary classification, including CNN-based (i.e., Xception [61] and EfficientB4 [62]) and transformer-based (i.e., ViT-B/16 [63]). Frequency-based: explore the frequency domain for forgery detection (i.e., F3Net [64], SPSL [65], and SRM [66]). Spatial-based: focus on mining spatial characteristics (e.g., texture) within images for detection (i.e., UCF [16], UnivFD [67], and CORE [68]). Fairness-enhanced: focus on improving fairness in AI-generated face detection by designing specific algorithms (i.e., DAW-FDD [20], DAG-FDD [20], and PG-FDD [21]). Implementation and training details refer to Appendix B.2.
Model Type | ||||||||||||||||||||||||||||||||||||||
Naive | Frequency | Spatial | Fairness-enhanced | |||||||||||||||||||||||||||||||||||
Measure | Attribute | Metric |
|
|
|
|
|
|
|
|
|
|
|
|
||||||||||||||||||||||||
0.387 | 1.176 | 0.187 | 0.279 | 0.454 | 0.533 | 0.305 | 0.458 | 1.635 | 0.404 | 0.272 | 0.236 | |||||||||||||||||||||||||||
2.843 | 2.052 | 2.489 | 2.941 | 2.998 | 2.433 | 2.890 | 2.456 | 1.977 | 2.979 | 2.799 | 2.614 | |||||||||||||||||||||||||||
0.271 | 0.595 | 0.422 | 0.086 | 0.188 | 0.268 | 0.169 | 0.557 | 0.977 | 0.123 | 0.192 | 0.134 | |||||||||||||||||||||||||||
Gender | 0.439 | 1.229 | 0.235 | 0.552 | 0.577 | 0.536 | 0.346 | 0.490 | 1.846 | 0.699 | 0.407 | 0.237 | ||||||||||||||||||||||||||
4.386 | 8.307 | 13.078 | 3.098 | 4.736 | 5.470 | 3.188 | 14.663 | 16.001 | 3.461 | 3.344 | 1.956 | |||||||||||||||||||||||||||
18.248 | 19.691 | 18.446 | 18.282 | 18.822 | 16.182 | 18.770 | 23.542 | 24.163 | 18.306 | 18.288 | 18.040 | |||||||||||||||||||||||||||
3.509 | 5.659 | 5.351 | 2.217 | 2.201 | 4.044 | 1.847 | 6.505 | 5.105 | 1.365 | 1.847 | 1.132 | |||||||||||||||||||||||||||
Race | 10.863 | 19.921 | 24.002 | 7.052 | 7.282 | 11.602 | 6.311 | 30.947 | 24.015 | 6.948 | 6.439 | 4.039 | ||||||||||||||||||||||||||
1.695 | 3.028 | 8.931 | 1.319 | 1.025 | 1.090 | 0.854 | 5.818 | 6.964 | 2.838 | 0.809 | 0.781 | |||||||||||||||||||||||||||
6.242 | 6.724 | 6.264 | 6.357 | 6.340 | 5.905 | 6.257 | 6.260 | 5.030 | 6.249 | 6.140 | 6.098 | |||||||||||||||||||||||||||
1.028 | 1.619 | 3.948 | 1.017 | 0.710 | 0.934 | 0.635 | 4.966 | 3.652 | 2.610 | 0.606 | 0.506 | |||||||||||||||||||||||||||
Age | 4.116 | 6.080 | 12.888 | 3.696 | 2.827 | 3.116 | 2.479 | 15.252 | 8.382 | 7.361 | 2.171 | 1.587 | ||||||||||||||||||||||||||
7.113 | 9.999 | 14.667 | 4.739 | 7.320 | 9.731 | 4.606 | 17.606 | 19.303 | 5.316 | 4.708 | 2.604 | |||||||||||||||||||||||||||
20.675 | 20.963 | 20.114 | 20.492 | 20.242 | 19.112 | 20.704 | 24.366 | 25.892 | 20.373 | 19.940 | 20.402 | |||||||||||||||||||||||||||
6.174 | 9.181 | 7.711 | 3.692 | 3.744 | 6.498 | 3.061 | 11.802 | 6.035 | 2.641 | 3.174 | 1.830 | |||||||||||||||||||||||||||
Intersection | 24.520 | 42.330 | 49.075 | 16.699 | 16.257 | 25.983 | 13.932 | 68.449 | 47.016 | 14.539 | 14.118 | 8.618 | ||||||||||||||||||||||||||
Individual | 112.067 | 585.935 | 0.125 | 46.083 | 22.982 | 1.383 | 3.246 | 8.606 | 0.598 | 28.437 | 13.706 | 0.477 | ||||||||||||||||||||||||||
Avg- | 6.824 | 9.529 | 7.706 | 5.941 | 6.647 | 5.471 | 4.353 | 9.941 | 9.235 | 6.118 | 3.765 | 2.059 | ||||||||||||||||||||||||||
Fairness↓ | Avg- | 8.020 | 6.020 | 7.843 | 3.981 | |||||||||||||||||||||||||||||||||
ACC(%) | 97.639 | 95.404 | 93.719 | 98.229 | 98.274 | 97.978 | 98.635 | 90.229 | 96.087 | 97.316 | 98.543 | 99.079 | ||||||||||||||||||||||||||
AUC(%) | 99.768 | 99.117 | 98.914 | 99.826 | 99.786 | 99.767 | 99.885 | 96.030 | 98.846 | 99.703 | 99.871 | 99.937 | ||||||||||||||||||||||||||
AP(%) | 99.846 | 99.359 | 99.240 | 99.885 | 99.853 | 99.829 | 99.917 | 96.973 | 98.987 | 99.802 | 99.916 | 99.956 | ||||||||||||||||||||||||||
Utility↑ | - | EER(%) | 2.388 | 4.794 | 5.829 | 1.741 | 1.610 | 2.134 | 1.365 | 10.680 | 4.656 | 2.701 | 1.365 | 1.212 | ||||||||||||||||||||||||
Training Time / Epoch | 1h35min | 3h07min | 3h26min | 1h41min | 1h37min | 4h05min | 5h10min | 5h07min | 1h36min | 1h45min | 1h38min | 7h45min |
Evaluation Metrics. To provide a comprehensive benchmarking, we consider 5 fairness metrics commonly used in fairness community [69, 70, 71, 72, 73] and 4 widely used utility metrics. For fairness metrics, we consider Demographic Parity () [69, 70], Max Equalized Odds () [72], Equal Odds () [71], and Overall Accuracy Equality () [72] for evaluating group (e.g., gender) and intersectional (e.g., individuals of a specific race and simultaneously a specific gender) fairness. We also use individual fairness () [73, 74] (i.e., similar individuals should have similar predicted outcomes) for estimation. Fairness metrics definition can be found in Appendix B.3. To compare detectors’ performance clearly and fairly, we define the Average Fairness Rank (Avg-), which ranks each detector on each fairness metric and averages these ranks. We also define Avg- for the average rank across methods within a model type. For utility metrics, we employ Accuracy (ACC), the Area Under the ROC Curve (AUC), Average Precision (AP), and Equal Error Rate (EER).
Results. Overall Performance. Table 5 reports the overall performance on our AI-Face test set. Our observations are: 1) Most detectors do not have fairness except for Fairness-enhanced detectors, which demonstrate relatively lower performance disparities. 2) The top 3 performing methods are PG-FDD [21], DAG-FDD [20], and UCF [16] according to Avg-. 3) According to Avg-, Fairness-enhanced detectors demonstrate superior performance. Frequency detectors surpass both Spatial and Naive detectors. A possible reason is that frequency features are more focused on the forgery trace while weakening the demographic features. This highlights a potential avenue for future research to enhance detector fairness by integrating frequency features with fairness-enhanced algorithms. 4) 9 out of 12 detectors have an AUC higher than 99%, demonstrating our AI-Face dataset is significant for training AI-face detectors in resulting high utility. 5) PG-FDD demonstrates superior performance but has a long training time, which can be explored and addressed in the future.
Performance on Different Subsets. Fig. 4 demonstrates the intersectional and AUC performance of detectors on each test subset (e.g., subsets originate from different generative methods). We observe that the fairness performance varies a lot among different generative methods in every detector. The largest bias on most detectors comes from detecting face images generated by STGAN [75] and Commercial Tools (CT), including DALLE2 [55], IF [55], and Midjourney [55]. Moreover, the stable utility demonstrates our dataset’s expansiveness and diversity, enabling effective training to detect AI-generated faces from various generative methods. Full evaluation results are in Appendix B.4.
Performance on Different Subgroups. We conduct an analysis of all detectors on intersectional subgroups: Male-White (M-W), Male-Black (M-B), Male-Asian (M-A), Male-Others (M-A), Female-White (F-W), Female-Black (F-B), Female-Asian (F-A), Female-Others (F-O). As shown in Fig. 5, it plots the ratios of FPR for each subgroup to a reference group (M-W). 1) It is clear that facial images of M-A, F-B, and F-A are more likely to be mistakenly detected as fake than facial images of M-W. 2) However, the FPR of M-W is higher than others in DAW-FDD. This highlights a challenge in algorithmic fairness methods: improving performance for minority groups can inadvertently raise the error rate for the majority group (e.g., M-W). See demographic distribution in Appendix A.2.1.
Fairness Robustness Evaluation. Images spread on public platforms usually undergo post-processing. Therefore, it is important to estimate the capability of detectors to preserve fairness robustness while handling distorted images. We apply 6 post-processing methods: Random Crop (RC) [76], Rotation (RT) [25], Brightness Contrast (BC) [25], Hue Saturation Value (HSV) [25], Gaussian Blur (GB) [25], and JEPG Compression (JC) [77] to the test images (see Appendix B.5 for more details). Fig. 6 shows each detector’s intersectional and AUC performance changes after using post-processing. Our observations are: 1) These impairments tend to wash out forensic traces, to the point that detectors have significant performance degradation. 2) Recent Fairness-enhanced detectors struggle to maintain fairness when images undergo post-processing. 3) Transform-based models (i.e., ViT-B/16 [63] and UnivFD [67]) demonstrate stronger robustness compared with CNN-based models. 4) JEPG Compression and Gaussian Blur cause notably greater performance degradation compared to others. See Appendix B.6 for more robustness analysis with respect to different degrees of post-processing.
Dataset | |||||||||||
A-DF-1.0 [19] | DF-Platter [35] | GenData [22] | |||||||||
Fairness(%)↓ | Utility(%)↑ | Fairness(%)↓ | Utility(%)↑ | Fairness(%)↓ | Utility(%)↑ | ||||||
Model Type | Detector | AUC | AUC | AUC | Avg- | ||||||
Xception [61] | 4.227(+3.956) | 9.198(+8.759) | 82.479(-17.289) | 2.308(+2.037) | 8.691(+8.252) | 75.933(-23.835) | 0.438(+0.167) | 1.724(+1.285) | 94.315(-5.453) | 5.167 | |
EfficientB4 [62] | 3.689(+3.094) | 17.017(+15.788) | 61.436(-37.681) | 4.459(+3.864) | 10.191(+8.962) | 63.871(-35.246) | 0.001(-0.594) | 3.621(+2.392) | 87.522(-11.595) | 8.000 | |
Naive | ViT-B/16 [63] | 4.45(+4.028) | 9.154(+8.919) | 70.896(-28.018) | 2.531(+2.109) | 5.557(+5.322) | 68.935(-29.979) | 1.249(+0.827) | 2.874(+2.639) | 89.109(-9.805) | 6.667 |
F3Net [64] | 1.749(+1.663) | 19.484(+18.932) | 86.265(-13.561) | 2.995(+2.909) | 5.445(+4.893) | 82.421(-17.405) | 0.155(+0.069) | 2.927(+2.375) | 93.882(-5.944) | 6.000 | |
SPSL [65] | 8.497(+8.309) | 2.430(+1.853) | 75.177(-24.609) | 3.323(+3.135) | 8.966(+8.389) | 82.024(-17.762) | 0.138(-0.050) | 2.321(+1.744) | 94.320(-5.466) | 6.167 | |
Frequency | SRM [66] | 3.708(+3.440) | 1.169(+0.633) | 65.779(-33.988) | 4.976(+4.708) | 33.702(+33.166) | 72.777(-26.990) | 1.545(+1.277) | 2.378(+1.842) | 94.130(-5.637) | 8.000 |
UCF [16] | 2.930(+2.761) | 9.924(+9.578) | 83.260(-16.625) | 3.536(+3.367) | 9.395(+9.049) | 83.92(-15.965) | 1.346(+1.177) | 1.377(+1.031) | 94.948(-4.937) | 6.500 | |
UnivFD [67] | 14.149(+13.592) | 1.833(+1.343) | 65.810(-30.220) | 7.686(+7.129) | 11.701(+11.211) | 69.483(-26.547) | 0.903(+0.346) | 2.227(+1.737) | 85.965(-10.065) | 8.167 | |
Spatial | CORE [68] | 0.308(-0.669) | 11.854(+10.008) | 79.222(-19.624) | 3.966(+2.989) | 5.267(+3.421) | 81.264(-17.582) | 0.005(-0.972) | 2.943(+1.097) | 94.329(-4.517) | 5.667 |
DAW-FDD [20] | 5.040(+4.917) | 4.993(+4.294) | 80.308(-19.395) | 2.577(+2.454) | 7.253(+6.554) | 78.562(-21.141) | 0.205(+0.082) | 2.708(+2.009) | 93.876(-5.827) | 6.000 | |
DAG-FDD [20] | 4.279(+4.087) | 13.565(+13.158) | 85.859(-14.012) | 3.885(+3.693) | 7.350(+6.943) | 83.153(-16.718) | 1.062(+0.870) | 1.688(+1.281) | 94.326(-5.545) | 7.167 | |
Fairness- enhanced | PG-FDD [21] | 4.263(+4.129) | 11.077(+10.840) | 81.174(-18.763) | 1.984(+1.850) | 4.715(+4.478) | 84.572(-15.365) | 1.205(+1.071) | 1.159(+0.922) | 94.962(-4.975) | 4.500 |
Fairness Generalization Evaluation. To evaluate detectors’ fairness generalization capability, we train them on AI-Face and test them on A-DF-1.0, DF-Platter, and GenData, none of which are part of AI-Face. Results on gender attribute in Table 6 show that: 1) According to Avg-, the top three methods excelling in fairness preservation are PG-FDD, Xception, and CORE. PG-FDD, specifically designed for fairness generalization, leads to overall performance. However, it does not excel in terms of performance changes compared with intra-domain test results from Table 5, indicating room for improvement in its generalization capabilities. 2) CORE is notable for demonstrating negative fairness performance changes on A-DF-1.0 and GenData, suggesting techniques within CORE that could be potentially explored to enhance fairness generalization. More results are in Appendix B.7.
Effect of Increasing Training Set Size. We randomly sample 20%, 40%, 60%, and 80% of each training subset from AI-Face to assess the impact of training size on performance. Key observations from Fig. 7: 1) The performance of UnivFD changes slightest and cannot be improved with the increasing of data size.
2) Overall, detectors’ performance improves with larger training size, though few show fluctuations (e.g., ViT-B/16 and CORE). 3) A larger training set may improve utility but not always fairness. For example, Xception and SRM show increased utility when training size grows from 60% to 80%, but fairness worsens. Similar trends are observed in DAG-FDD and SPSL when the training set size increases from 40% to 60%. See Appendix B.8 for full results.
Discussion. According to the above experiments, we summarize the unsolved fairness problems in recent detectors: 1) Detectors’ fairness is unstable when detecting face images generated by different generative methods, indicating a future direction for enhancing fairness stability since new generative models continue to emerge. 2) Even though fairness-enhanced detectors exhibit small overall fairness metrics, they still show biased detection towards minority groups. Future studies should be more cautious when designing fair detectors to ensure balanced performance across all demographic groups. 3) There is currently no reliable detector, as all detectors experience severe large performance degradation under image post-processing and cross-domain evaluation. Future studies should aim to develop a unified framework that ensures fairness, robustness, and generalization, as these three characteristics are essential for creating a reliable detector.
5 Conclusion
This work presents the first demographically annotated million-scale AI-Face dataset, serving as a pivotal foundation for addressing the urgent need for developing fair AI face detectors. Based on our AI-Face dataset, we conduct the first comprehensive fairness benchmark, shedding light on the fairness performance and challenges of current representative AI face detectors. Our findings can inspire and guide researchers in refining current models and exploring new methods to mitigate bias. Limitation and Future Work: One limitation is that age annotations in our AI-Face dataset have relatively lower accuracy as the age attribute is often too ambiguous to predict. We will improve our annotator’s accuracy in predicting age attributes in the future. Additionally, we plan to extend our fairness benchmark to evaluate large language models like LLaMA2 [78] and GPT4 [79] for detecting AI faces. Social Impact: Malicious users could misuse AI-generated face images from our dataset to create fake social media profiles and spread misinformation. To mitigate this risk, only users who submit a signed end-user license agreement (EULA) will be granted access to our dataset.
Acknowledgment
This work is supported by the U.S. National Science Foundation (NSF) under grant IIS-2348419 and the National Artificial Intelligence Research Resource (NAIRR) Pilot and TACC Lonestar6. The views, opinions and/or findings expressed are those of the author and should not be interpreted as representing the official views or policies of NSF and NAIRR Pilot.
References
- [1] Li Lin, Neeraj Gupta, Yue Zhang, Hainan Ren, Chun-Hao Liu, Feng Ding, Xin Wang, Xin Li, Luisa Verdoliva, and Shu Hu. Detecting multimedia generated by large ai models: A survey. arXiv preprint arXiv:2402.00045, 2024.
- [2] Andreas Rossler, Davide Cozzolino, Luisa Verdoliva, Christian Riess, Justus Thies, and Matthias Nießner. Faceforensics++: Learning to detect manipulated facial images. In Proceedings of the IEEE/CVF international conference on computer vision, pages 1–11, 2019.
- [3] Deepfakes github. https://github.com/deepfakes/faceswap. Accessed: 2024-04-17.
- [4] Fakeapp. https://www.fakeapp.com/. Accessed: 2024-04-17.
- [5] Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale gan training for high fidelity natural image synthesis. In 7th International Conference on Learning Representations, ICLR 2019, 2019.
- [6] Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4401–4410, 2019.
- [7] Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Analyzing and improving the image quality of stylegan. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8110–8119, 2020.
- [8] Tero Karras, Miika Aittala, Samuli Laine, Erik Härkönen, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Alias-free generative adversarial networks. Advances in neural information processing systems, 34:852–863, 2021.
- [9] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
- [10] Daniel J Tojin T. Eapen. How generative ai can augment human creativity. https://hbr.org/2023/07/how-generative-ai-can-augment-human-creativity, 2023. Accessed: 2024-04-21.
- [11] BBC News. Trump supporters target black voters with faked ai images. https://www.bbc.com/news/world-us-canada-68440150, 2024. Accessed: 2023-05-09.
- [12] Henrik Skaug Sætra. Generative ai: Here to stay, but for good? Technology in Society, 75:102372, 2023.
- [13] Mika Westerlund. The emergence of deepfake technology: A review. Technology innovation management review, 9(11), 2019.
- [14] Wenbo Pu, Jing Hu, Xin Wang, Yuezun Li, Shu Hu, Bin Zhu, Rui Song, Qi Song, Xi Wu, and Siwei Lyu. Learning a deep dual-level network for robust deepfake detection. Pattern Recognition, 130:108832, 2022.
- [15] Hui Guo, Shu Hu, Xin Wang, Ming-Ching Chang, and Siwei Lyu. Robust attentive deep neural network for detecting gan-generated faces. IEEE Access, 10:32574–32583, 2022.
- [16] Zhiyuan Yan, Yong Zhang, Yanbo Fan, and Baoyuan Wu. Ucf: Uncovering common features for generalizable deepfake detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 22412–22423, 2023.
- [17] Lorenzo Papa, Lorenzo Faiella, Luca Corvitto, Luca Maiano, and Irene Amerini. On the use of stable diffusion for creating realistic faces: from generation to detection. In 2023 11th International Workshop on Biometrics and Forensics (IWBF), pages 1–6. IEEE, 2023.
- [18] Loc Trinh and Yan Liu. An examination of fairness of ai models for deepfake detection. IJCAI, 2021.
- [19] Ying Xu, Philipp Terhöst, Marius Pedersen, and Kiran Raja. Analyzing fairness in deepfake detection with massively annotated databases. IEEE Transactions on Technology and Society, 2024.
- [20] Yan Ju, Shu Hu, Shan Jia, George H Chen, and Siwei Lyu. Improving fairness in deepfake detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 4655–4665, 2024.
- [21] Li Lin, Xinan He, Yan Ju, Xin Wang, Feng Ding, and Shu Hu. Preserving fairness generalization in deepfake detection. CVPR, 2024.
- [22] Christopher Teo, Milad Abdollahzadeh, and Ngai-Man Man Cheung. On measuring fairness in generative models. Advances in Neural Information Processing Systems, 36, 2024.
- [23] Chuqiao Li, Zhiwu Huang, Danda Pani Paudel, Yabin Wang, Mohamad Shahbazi, Xiaopeng Hong, and Luc Van Gool. A continual deepfake detection benchmark: Dataset, methods, and essentials. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1339–1349, 2023.
- [24] Jingyi Deng, Chenhao Lin, Pengbin Hu, Chao Shen, Qian Wang, Qi Li, and Qiming Li. Towards benchmarking and evaluating deepfake detection. IEEE Transactions on Dependable and Secure Computing, 2024.
- [25] Zhiyuan Yan, Yong Zhang, Xinhang Yuan, Siwei Lyu, and Baoyuan Wu. Deepfakebench: A comprehensive benchmark of deepfake detection. In NeurIPS, 2023.
- [26] Binh M Le, Jiwon Kim, Shahroz Tariq, Kristen Moore, Alsharif Abuadbba, and Simon S Woo. Sok: Facial deepfake detectors. arXiv, 2024.
- [27] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
- [28] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. Advances in neural information processing systems, 27, 2014.
- [29] Midjourney. https://mid-journey.ai/. Accessed: 2024-04-17.
- [30] Aditya Ramesh et al. Hierarchical text-conditional image generation with clip latents. arXiv, 1(2):3, 2022.
- [31] Donie O’Sullivan. A high school student created a fake 2020 us candidate. twitter verified it. https://cnn.it/3HpHfzz, 2020. Accessed: 2024-04-21.
- [32] Shannon Bond. That smiling linkedin profile face might be a computer-generated fake. https://www.npr.org/2022/03/27/1088140809/fake-linkedin-profiles, 2022. Accessed: 2024-04-21.
- [33] Liming Jiang, Ren Li, Wayne Wu, Chen Qian, and Chen Change Loy. Deeperforensics-1.0: A large-scale dataset for real-world face forgery detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2889–2898, 2020.
- [34] Kartik Narayan, Harsh Agarwal, Kartik Thakral, Surbhi Mittal, Mayank Vatsa, and Richa Singh. Deephy: On deepfake phylogeny. In 2022 IEEE International Joint Conference on Biometrics (IJCB), pages 1–10. IEEE, 2022.
- [35] Kartik Narayan, Harsh Agarwal, Kartik Thakral, Surbhi Mittal, Mayank Vatsa, and Richa Singh. Df-platter: multi-face heterogeneous deepfake dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9739–9748, 2023.
- [36] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), December 2015.
- [37] Dong Yi, Zhen Lei, Shengcai Liao, and Stan Z Li. Learning face representation from scratch. arXiv, 2014.
- [38] Rasmus Rothe, Radu Timofte, and Luc Van Gool. Dex: Deep expectation of apparent age from a single image. In Proceedings of the IEEE international conference on computer vision workshops, pages 10–15, 2015.
- [39] Brian Dolhansky, Joanna Bitton, Ben Pflaum, Jikuo Lu, Russ Howes, Menglin Wang, and Cristian Canton Ferrer. The deepfake detection challenge (dfdc) dataset. arXiv preprint arXiv:2006.07397, 2020.
- [40] Google Research. Contributing data to deepfake detection research, 2019. Accessed: 2024-04-12.
- [41] Yuezun Li, Xin Yang, Pu Sun, Honggang Qi, and Siwei Lyu. Celeb-df: A large-scale challenging dataset for deepfake forensics. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3207–3216, 2020.
- [42] Megvii Technology Limited. Face++ Face Detection. https://www.faceplusplus.com/face-detection/. Accessed: 2024-03.
- [43] InsightFace Project Contributors. InsightFace: State-of-the-Art Face Analysis Toolbox. https://insightface.ai/. Accessed: 2024-03.
- [44] Gabriel Ilharco, Mitchell Wortsman, Ross Wightman, Cade Gordon, Nicholas Carlini, Rohan Taori, Achal Dave, Vaishaal Shankar, Hongseok Namkoong, John Miller, Hannaneh Hajishirzi, Ali Farhadi, and Ludwig Schmidt. Open clip. https://github.com/mlfoundations/open_clip, 2021.
- [45] Pierre Foret, Ariel Kleiner, Hossein Mobahi, and Behnam Neyshabur. Sharpness-aware minimization for efficiently improving generalization. In International Conference on Learning Representations, 2020.
- [46] Philipp Terhörst, Marco Huber, Jan Niklas Kolf, Ines Zelch, Naser Damer, Florian Kirchbuchner, and Arjan Kuijper. Reliable age and gender estimation from face images: Stating the confidence of model predictions. In 2019 IEEE 10th International Conference on Biometrics Theory, Applications and Systems (BTAS), pages 1–8. IEEE, 2019.
- [47] Qiong Cao, Li Shen, Weidi Xie, Omkar M Parkhi, and Andrew Zisserman. Vggface2: A dataset for recognising faces across pose and age. In 2018 13th IEEE international conference on automatic face & gesture recognition (FG 2018), pages 67–74. IEEE, 2018.
- [48] Philipp Terhörst, Daniel Fährmann, Jan Niklas Kolf, Naser Damer, Florian Kirchbuchner, and Arjan Kuijper. Maad-face: A massively annotated attribute dataset for face images. IEEE Transactions on Information Forensics and Security, 16:3942–3957, 2021.
- [49] Oliver Giudice, Luca Guarnera, and Sebastiano Battiato. Fighting deepfakes by detecting gan dct anomalies. Journal of Imaging, 7(8):128, 2021.
- [50] Vishal Asnani, Xi Yin, Tal Hassner, and Xiaoming Liu. Reverse engineering of generative models: Inferring model hyperparameters from generated images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
- [51] David Beniaguev. Synthetic faces high quality (sfhq) dataset, 2022.
- [52] Zeyu Lu, Di Huang, Lei Bai, Jingjing Qu, Chengyue Wu, Xihui Liu, and Wanli Ouyang. Seeing is not always believing: Benchmarking human and model perception of ai-generated images. Advances in Neural Information Processing Systems, 36, 2024.
- [53] L Minh Dang, Syed Ibrahim Hassan, Suhyeon Im, Jaecheol Lee, Sujin Lee, and Hyeonjoon Moon. Deep learning based computer generated face identification using convolutional neural network. Applied Sciences, 8(12):2610, 2018.
- [54] Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12873–12883, 2021.
- [55] Zhendong Wang, Jianmin Bao, Wengang Zhou, Weilun Wang, Hezhen Hu, Hong Chen, and Houqiang Li. Dire for diffusion-generated image detection. arXiv preprint arXiv:2303.09295, 2023.
- [56] Minchul Kim, Feng Liu, Anil Jain, and Xiaoming Liu. Dcface: Synthetic face generation with dual condition diffusion model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12715–12725, 2023.
- [57] Riccardo Corvi, Davide Cozzolino, Giada Zingarini, Giovanni Poggi, Koki Nagano, and Luisa Verdoliva. On the detection of synthetic images generated by diffusion models. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023.
- [58] Md Awsafur Rahman, Bishmoy Paul, Najibul Haque Sarker, Zaber Ibn Abdul Hakim, and Shaikh Anowarul Fattah. Artifact: A large-scale dataset with artificial and factual images for generalizable and robust synthetic image detection. arXiv e-prints, pages arXiv–2302, 2023.
- [59] Haixu Song, Shiyu Huang, Yinpeng Dong, and Wei-Wei Tu. Robustness and generalizability of deepfake detection: A study with diffusion models. arXiv preprint arXiv:2309.02218, 2023.
- [60] Jiankang Deng, Jia Guo, Evangelos Ververas, Irene Kotsia, and Stefanos Zafeiriou. Retinaface: Single-shot multi-level face localisation in the wild. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5203–5212, 2020.
- [61] François Chollet. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1251–1258, 2017.
- [62] Mingxing Tan and Quoc Le. Efficientnet: Rethinking model scaling for convolutional neural networks. In International conference on machine learning, pages 6105–6114. PMLR, 2019.
- [63] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. In 9th International Conference on Learning Representations, 2021.
- [64] Yuyang Qian, Guojun Yin, Lu Sheng, Zixuan Chen, and Jing Shao. Thinking in frequency: Face forgery detection by mining frequency-aware clues. In European conference on computer vision, pages 86–103. Springer, 2020.
- [65] Honggu Liu, Xiaodan Li, Wenbo Zhou, Yuefeng Chen, Yuan He, Hui Xue, Weiming Zhang, and Nenghai Yu. Spatial-phase shallow learning: rethinking face forgery detection in frequency domain. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 772–781, 2021.
- [66] Yuchen Luo, Yong Zhang, Junchi Yan, and Wei Liu. Generalizing face forgery detection with high-frequency features. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16317–16326, 2021.
- [67] Utkarsh Ojha, Yuheng Li, and Yong Jae Lee. Towards universal fake image detectors that generalize across generative models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24480–24489, 2023.
- [68] Yunsheng Ni, Depu Meng, Changqian Yu, Chengbin Quan, Dongchun Ren, and Youjian Zhao. Core: Consistent representation learning for face forgery detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12–21, 2022.
- [69] Xiaotian Han, Jianfeng Chi, Yu Chen, Qifan Wang, Han Zhao, Na Zou, and Xia Hu. Ffb: A fair fairness benchmark for in-processing group fairness methods. In ICLR, 2024.
- [70] Ninareh Mehrabi, Fred Morstatter, Nripsuta Saxena, Kristina Lerman, and Aram Galstyan. A survey on bias and fairness in machine learning. ACM computing surveys (CSUR), 54(6):1–35, 2021.
- [71] Jialu Wang, Xin Eric Wang, and Yang Liu. Understanding instance-level impact of fairness constraints. In International Conference on Machine Learning, pages 23114–23130. PMLR, 2022.
- [72] Hao Wang, Luxi He, Rui Gao, and Flavio P Calmon. Aleatoric and epistemic discrimination in classification. ICML, 2023.
- [73] Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard Zemel. Fairness through awareness. In Proceedings of the 3rd innovations in theoretical computer science conference, pages 214–226, 2012.
- [74] Shu Hu and George H Chen. Fairness in survival analysis with distributionally robust optimization. arXiv, 2023.
- [75] Ming Liu, Yukang Ding, Min Xia, Xiao Liu, Errui Ding, Wangmeng Zuo, and Shilei Wen. Stgan: A unified selective transfer network for arbitrary image attribute editing. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3673–3682, 2019.
- [76] Federico Cocchi, Lorenzo Baraldi, Samuele Poppi, Marcella Cornia, Lorenzo Baraldi, and Rita Cucchiara. Unveiling the impact of image transformations on deepfake detection: An experimental analysis. In International Conference on Image Analysis and Processing, pages 345–356. Springer, 2023.
- [77] Davide Cozzolino, Giovanni Poggi, Riccardo Corvi, Matthias Nießner, and Luisa Verdoliva. Raising the bar of ai-generated image detection with clip. arXiv preprint arXiv:2312.00195, 2023.
- [78] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- [79] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- [80] Ying Xu et al. A comprehensive analysis of ai biases in deepfake detection with massively annotated databases. arXiv, 2022.
- [81] Eran Eidinger, Roee Enbar, and Tal Hassner. Age and gender estimation of unfiltered faces. IEEE Transactions on information forensics and security, 9(12):2170–2179, 2014.
- [82] Robert Williamson and Aditya Menon. Fairness risk measures. In International conference on machine learning, pages 6786–6797. PMLR, 2019.
- [83] Daniel Levy, Yair Carmon, John C Duchi, and Aaron Sidford. Large-scale methods for distributionally robust optimization. Advances in Neural Information Processing Systems, 33:8847–8860, 2020.
- [84] R Tyrrell Rockafellar, Stanislav Uryasev, et al. Optimization of conditional value-at-risk. Journal of risk, 2:21–42, 2000.
- [85] Tatsunori Hashimoto, Megha Srivastava, Hongseok Namkoong, and Percy Liang. Fairness without demographics in repeated loss minimization. In International Conference on Machine Learning, pages 1929–1938. PMLR, 2018.
- [86] John C Duchi and Hongseok Namkoong. Learning models with uniform performance via distributionally robust optimization. The Annals of Statistics, 49(3):1378–1406, 2021.
- [87] Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé Iii, and Kate Crawford. Datasheets for datasets. Communications of the ACM, 64(12):86–92, 2021.
Appendix
Appendix A The Details of Demographically Annotated AI-Face Dataset
A.1 Phase1: Annotator Development
A.1.1 Annotator Implementation Details
For developing the annotator, all experiments are based on the PyTorch with a single NVIDIA RTX A6000 GPU. For training, we fix the batch size 64, epochs 32, and use Adam optimizer with an initial learning rate . Additionally, we employ a Cosine Annealing Learning Rate Scheduler to modulate the learning rate adaptively across the training duration. The hyperparameter in SAM optimization is set as 0.05. For uncertainty estimation, and in uncertainty score are set as 100 and 0.2, respectively.
A.1.2 Details of Threshold Settings for Sample Difficulty Level
For Q2, Setting: According to the distribution as shown in Appendix A.2.2, for VGGFace2 [47], A-DFDC [80], and A-DFD [80] test set, the threshold and are set as 0.25 and 0.4, respectively. And and are set as 0.3 and 0.5, respectively. The threshold for gender attribute is more strict than age because gender attribute prediction is a relatively easier task than age, as well as reflecting from the distribution. For A-FF++ [80] and A-Celeb-DF-v2 [80], we adjust the threshold to 0.21 and to 0.25 in order to get sufficient 1,500 images in each sample difficulty level subset, especially for ‘Hard’ level.
A.1.3 Additional Annotator Evaluation Results
From Table 7 to Table 11 are comparison results of our annotator against baselines InsightFace [43] and Face++[42] on detailed attributes. The findings and results align with the results in Table 3 of the submitted manuscript. For cross-domain evaluation, we additionally choose Adience [81] dataset, where images are manually annotated, consisting of over 26.5k real images of over 2.2k different individuals in unconstrained environments, to further validate the effectiveness and good generalization capability of our annotator. Results in table 12 demonstrate our annotator outperforms InsightFace [43] and Face++[42] again. Overall, one intra-domain dataset (VGGFace2) and five cross-domain datasets (A-FF++, A-DFDC, A-DFD, A-Celeb-DF-v2, and Adience) all validate that our annotator’s superior performance against current state-of-the-art face attribute prediction tools Face++ [42] and InsightFace [43].
Level | Method | VGGFace2 [47] | ||||||||||||||
Female | Male | Young | Middle_Aged | Senior | ||||||||||||
precision | recall | F1 | precision | recall | F1 | precision | recall | F1 | precision | recall | F1 | precision | recall | F1 | ||
77.042 | 83.597 | 80.060 | 79.571 | 72.314 | 75.584 | 80.201 | 34.649 | 46.991 | 46.049 | 63.692 | 53.327 | 66.266 | 77.261 | 71.320 | ||
Face++ [42] | (0.363) | (0.789) | (0.444) | (0.780) | (0.516) | (0.453) | (2.448) | (1.182) | (1.278) | (0.832) | (1.588) | (0.976) | (1.030) | (1.188) | (0.955) | |
76.560 | 78.062 | 77.281 | 76.946 | 75.395 | 76.139 | 78.730 | 30.533 | 42.907 | 41.660 | 56.653 | 47.936 | 61.426 | 76.107 | 67.966 | ||
InsightFace [43] | (0.533) | (0.737) | (0.534) | (0.629) | (0.555) | (0.474) | (2.135) | (0.961) | (1.052) | (1.005) | (1.429) | (1.078) | (0.649) | (1.212) | (0.724) | |
81.452 | 89.467 | 85.158 | 87.267 | 78.329 | 82.401 | 95.619 | 80.467 | 86.659 | 91.189 | 75.027 | 81.665 | 88.043 | 76.720 | 81.747 | ||
All | Ours | (0.413) | (0.331) | (0.323) | (0.393) | (0.572) | (0.438) | (0.565) | (1.343) | (0.629) | (0.812) | (0.880) | (0.585) | (0.982) | (1.388) | (0.837) |
97.482 | 96.742 | 97.108 | 96.697 | 97.439 | 97.064 | 89.360 | 57.507 | 69.964 | 58.673 | 72.205 | 64.729 | 79.536 | 89.546 | 84.240 | ||
Face++ [42] | (0.604) | (0.549) | (0.329) | (0.545) | (0.642) | (0.355) | (0.670) | (1.825) | (1.403) | (0.837) | (1.397) | (0.699) | (1.119) | (0.600) | (0.687) | |
97.625 | 96.373 | 96.994 | 96.420 | 97.653 | 97.032 | 88.544 | 50.080 | 63.957 | 53.146 | 65.720 | 58.762 | 73.657 | 88.200 | 80.271 | ||
InsightFace [43] | (0.403) | (0.229) | (0.124) | (0.206) | (0.410) | (0.135) | (0.917) | (1.831) | (1.585) | (0.621) | (1.001) | (0.497) | (0.808) | (0.748) | (0.555) | |
99.575 | 99.893 | 99.734 | 99.893 | 99.573 | 99.733 | 99.720 | 99.760 | 99.740 | 99.279 | 99.000 | 99.139 | 99.551 | 96.480 | 97.988 | ||
Easy | Ours | (0.176) | (0.100) | (0.126) | (0.100) | (0.177) | (0.127) | (0.098) | (0.080) | (0.049) | (0.267) | (0.400) | (0.164) | (0.431) | (0.688) | (0.204) |
72.977 | 81.336 | 76.927 | 78.435 | 69.245 | 73.549 | 76.257 | 24.427 | 36.996 | 41.197 | 61.838 | 49.446 | 62.616 | 73.719 | 67.710 | ||
Face++ [42] | (0.292) | (1.246) | (0.679) | (1.101) | (0.681) | (0.618) | (2.732) | (0.709) | (1.057) | (0.961) | (2.458) | (1.472) | (1.171) | (0.503) | (0.759) | |
73.710 | 75.360 | 74.521 | 74.807 | 73.120 | 73.950 | 74.222 | 22.080 | 34.033 | 37.372 | 53.480 | 43.995 | 58.070 | 73.840 | 65.008 | ||
InsightFace [43] | (0.699) | (1.316) | (0.907) | (1.070) | (0.766) | (0.753) | (2.488) | (0.588) | (0.928) | (1.208) | (2.160) | (1.535) | (0.282) | (1.216) | (0.497) | |
82.518 | 95.253 | 88.428 | 94.389 | 79.813 | 86.489 | 96.648 | 84.200 | 89.987 | 95.382 | 75.920 | 84.540 | 87.960 | 76.800 | 81.993 | ||
Medium | Ours | (0.621) | (0.482) | (0.439) | (0.544) | (0.858) | (0.583) | (0.627) | (1.730) | (0.111) | (0.604) | (1.017) | (0.650) | (0.915) | (1.544) | (1.018) |
60.667 | 72.714 | 66.146 | 63.582 | 50.259 | 56.140 | 74.987 | 22.012 | 34.013 | 38.276 | 57.032 | 45.807 | 56.647 | 68.517 | 62.009 | ||
Face++ [42] | (0.194) | (0.571) | (0.323) | (0.694) | (0.226) | (0.385) | (3.942) | (1.012) | (1.374) | (0.697) | (0.910) | (0.758) | (0.799) | (2.462) | (1.420) | |
58.345 | 62.453 | 60.329 | 59.611 | 55.413 | 57.435 | 73.425 | 19.440 | 30.730 | 34.462 | 50.760 | 41.050 | 52.550 | 66.280 | 58.618 | ||
InsightFace [43] | (0.498) | (0.667) | (0.570) | (0.610) | (0.489) | (0.533) | (2.999) | (0.463) | (0.642) | (1.187) | (1.127) | (1.202) | (0.858) | (1.671) | (1.121) | |
62.263 | 73.255 | 67.312 | 67.518 | 55.600 | 60.982 | 90.490 | 57.440 | 70.249 | 78.905 | 50.160 | 61.315 | 76.617 | 56.880 | 65.260 | ||
Hard | Ours | (0.443) | (0.410) | (0.403) | (0.534) | (0.680) | (0.604) | (0.971) | (2.218) | (1.726) | (1.564) | (1.222) | (0.941) | (1.600) | (1.933) | (1.288) |
Level | Method | A-FF++ [80] | ||||||||||||||
Female | Male | Young | Middle_Aged | Senior | ||||||||||||
precision | recall | F1 | precision | recall | F1 | precision | recall | F1 | precision | recall | F1 | precision | recall | F1 | ||
90.106 | 88.345 | 89.129 | 88.560 | 90.007 | 89.201 | 74.707 | 72.839 | 73.547 | 52.934 | 81.935 | 63.981 | 84.769 | 37.933 | 51.193 | ||
Face++ [42] | (0.282) | (0.400) | (0.220) | (0.323) | (0.295) | (0.188) | (1.264) | (0.581) | (0.656) | (0.790) | (1.288) | (0.852) | (0.922) | (1.705) | (1.884) | |
87.918 | 81.284 | 84.344 | 82.787 | 88.662 | 85.533 | 67.148 | 83.284 | 74.317 | 48.659 | 74.220 | 58.725 | 93.455 | 20.959 | 33.854 | ||
InsightFace [43] | (0.458) | (0.861) | (0.560) | (0.625) | (0.472) | (0.399) | (1.429) | (0.427) | (0.959) | (0.658) | (1.532) | (0.831) | (1.228) | (1.708) | (2.315) | |
89.851 | 94.149 | 91.866 | 93.177 | 88.451 | 90.641 | 90.888 | 87.477 | 88.689 | 88.461 | 71.439 | 78.337 | 93.390 | 66.519 | 77.023 | ||
All | Ours | (0.290) | (0.302) | (0.202) | (0.280) | (0.384) | (0.226) | (0.632) | (0.280) | (0.396) | (1.556) | (1.064) | (1.012) | (0.686) | (1.461) | (1.138) |
98.157 | 97.947 | 98.052 | 97.950 | 98.159 | 98.054 | 85.686 | 93.398 | 89.372 | 69.170 | 89.520 | 78.036 | 95.646 | 58.880 | 72.888 | ||
Face++ [42] | (0.302) | (0.136) | (0.146) | (0.131) | (0.308) | (0.151) | (0.625) | (1.072) | (0.676) | (0.878) | (0.431) | (0.651) | (0.371) | (0.546) | (0.427) | |
97.004 | 96.640 | 96.820 | 96.656 | 97.013 | 96.833 | 76.172 | 98.120 | 85.762 | 61.246 | 86.840 | 71.828 | 98.106 | 28.800 | 44.524 | ||
InsightFace [43] | (0.470) | (0.605) | (0.387) | (0.579) | (0.482) | (0.380) | (1.303) | (0.665) | (1.016) | (0.576) | (1.039) | (0.519) | (0.755) | (0.633) | (0.718) | |
97.987 | 99.920 | 98.944 | 99.919 | 97.947 | 98.923 | 91.450 | 100.000 | 95.530 | 99.494 | 94.320 | 96.838 | 95.154 | 85.022 | 89.802 | ||
Easy | Ours | (0.222) | (0.065) | (0.092) | (0.067) | (0.233) | (0.097) | (0.862) | (0.000) | (0.470) | (0.170) | (0.483) | (0.286) | (0.500) | (0.724) | (0.478) |
98.839 | 89.590 | 93.985 | 90.605 | 98.960 | 94.597 | 84.212 | 80.092 | 82.098 | 48.990 | 86.710 | 62.604 | 91.594 | 25.400 | 39.746 | ||
Face++ [42] | (0.312) | (0.691) | (0.294) | (0.553) | (0.285) | (0.230) | (0.742) | (0.379) | (0.333) | (0.591) | (0.585) | (0.520) | (0.670) | (1.688) | (2.102) | |
95.655 | 79.813 | 87.017 | 82.685 | 96.373 | 89.005 | 69.090 | 85.800 | 76.542 | 45.982 | 73.680 | 56.622 | 96.654 | 15.040 | 26.022 | ||
InsightFace [43] | (0.499) | (0.798) | (0.546) | (0.573) | (0.433) | (0.408) | (0.862) | (0.358) | (0.579) | (0.273) | (0.985) | (0.435) | (0.636) | (0.794) | (1.194) | |
98.907 | 98.827 | 98.866 | 98.828 | 98.907 | 98.867 | 96.968 | 97.748 | 97.356 | 95.636 | 79.488 | 86.812 | 96.102 | 63.732 | 76.632 | ||
Medium | Ours | (0.227) | (0.131) | (0.103) | (0.128) | (0.229) | (0.104) | (0.311) | (0.230) | (0.158) | (0.900) | (0.673) | (0.415) | (0.399) | (1.483) | (1.134) |
73.323 | 77.498 | 75.352 | 77.123 | 72.901 | 74.952 | 54.224 | 45.026 | 49.172 | 40.642 | 69.576 | 51.302 | 67.066 | 29.518 | 40.946 | ||
Face++ [42] | (0.231) | (0.374) | (0.220) | (0.284) | (0.292) | (0.184) | (2.425) | (0.291) | (0.960) | (0.901) | (2.847) | (1.385) | (1.727) | (2.881) | (3.122) | |
71.096 | 67.400 | 69.194 | 69.020 | 72.600 | 70.762 | 56.182 | 65.932 | 60.648 | 38.748 | 62.140 | 47.724 | 85.604 | 19.036 | 31.016 | ||
72.658 | 83.700 | 77.787 | 80.784 | 68.500 | 74.133 | 84.246 | 64.684 | 73.180 | 70.252 | 40.508 | 51.360 | 88.914 | 50.804 | 64.634 | ||
Hard | Ours | (0.419) | (0.710) | (0.411) | (0.645) | (0.691) | (0.477) | (0.722) | (0.610) | (0.561) | (3.600) | (2.035) | (2.334) | (1.160) | (2.175) | (1.803) |
Level | Method | A-DFDC [19] | ||||||||||||||
Female | Male | Young | Middle_Aged | Senior | ||||||||||||
precision | recall | F1 | precision | recall | F1 | precision | recall | F1 | precision | recall | F1 | precision | recall | F1 | ||
74.751 | 83.324 | 78.453 | 78.824 | 68.797 | 72.922 | 74.499 | 81.417 | 77.479 | 58.696 | 74.613 | 65.608 | 85.533 | 52.741 | 63.542 | ||
Face++ [42] | (0.675) | (0.948) | (0.603) | (0.933) | (0.929) | (0.732) | (0.567) | (1.186) | (0.772) | (0.688) | (0.420) | (0.496) | (0.792) | (1.559) | (1.469) | |
72.691 | 64.662 | 68.376 | 68.160 | 75.560 | 71.623 | 71.209 | 86.658 | 77.985 | 58.102 | 76.777 | 66.086 | 83.061 | 36.659 | 49.620 | ||
InsightFace [43] | (0.649) | (1.208) | (0.671) | (0.779) | (0.754) | (0.453) | (0.852) | (0.703) | (0.676) | (1.040) | (0.370) | (0.797) | (0.989) | (1.802) | (1.833) | |
74.805 | 88.720 | 80.839 | 85.137 | 67.864 | 74.834 | 88.669 | 76.413 | 81.695 | 91.753 | 77.845 | 83.772 | 90.797 | 77.147 | 82.823 | ||
All | Ours | (0.588) | (0.587) | (0.498) | (0.622) | (0.848) | (0.619) | (0.633) | (1.103) | (0.752) | (0.736) | (0.670) | (0.545) | (0.618) | (1.356) | (1.116) |
94.124 | 89.917 | 91.965 | 90.351 | 94.367 | 92.309 | 86.586 | 96.596 | 91.316 | 73.340 | 85.560 | 78.974 | 99.440 | 71.324 | 83.066 | ||
Face++ [42] | (0.977) | (1.010) | (0.564) | (0.835) | (1.025) | (0.535) | (0.691) | (0.713) | (0.489) | (0.853) | (0.933) | (0.622) | (0.255) | (1.013) | (0.768) | |
91.874 | 78.693 | 84.753 | 81.389 | 93.013 | 86.801 | 73.580 | 95.760 | 83.216 | 62.690 | 77.960 | 69.496 | 94.034 | 42.760 | 58.776 | ||
InsightFace [43] | (0.986) | (1.798) | (0.744) | (1.140) | (1.059) | (0.384) | (0.821) | (0.196) | (0.552) | (0.659) | (0.898) | (0.746) | (1.262) | (1.209) | (1.189) | |
95.289 | 97.413 | 96.337 | 97.353 | 95.173 | 96.248 | 97.448 | 99.240 | 98.334 | 94.902 | 96.996 | 95.938 | 99.730 | 89.844 | 94.528 | ||
Easy | Ours | (1.019) | (0.136) | (0.566) | (0.157) | (1.098) | (0.621) | (0.635) | (0.233) | (0.359) | (0.030) | (0.540) | (0.280) | (0.168) | (0.433) | (0.303) |
70.438 | 81.415 | 75.529 | 77.764 | 65.533 | 71.124 | 70.980 | 64.526 | 67.584 | 53.830 | 69.930 | 60.828 | 73.342 | 58.332 | 64.972 | ||
Face++ [42] | (0.334) | (0.881) | (0.528) | (0.836) | (0.418) | (0.457) | (0.703) | (2.147) | (1.379) | (0.578) | (0.000) | (0.372) | (0.815) | (1.562) | (1.158) | |
69.126 | 66.733 | 67.905 | 67.859 | 70.200 | 69.006 | 73.794 | 77.866 | 75.768 | 54.900 | 71.330 | 62.038 | 68.248 | 44.000 | 53.474 | ||
InsightFace [43] | (0.316) | (1.285) | (0.813) | (0.788) | (0.221) | (0.345) | (1.105) | (1.205) | (0.966) | (1.229) | (0.000) | (0.780) | (0.711) | (2.241) | (1.769) | |
69.072 | 95.067 | 80.011 | 92.097 | 57.433 | 70.746 | 89.568 | 67.934 | 77.250 | 94.220 | 69.246 | 79.824 | 82.306 | 82.014 | 82.158 | ||
Medium | Ours | (0.350) | (0.680) | (0.444) | (1.027) | (0.512) | (0.593) | (0.325) | (2.048) | (1.451) | (1.257) | (0.431) | (0.524) | (1.062) | (0.908) | (0.891) |
59.692 | 78.639 | 67.866 | 68.357 | 46.489 | 55.334 | 65.930 | 83.128 | 73.536 | 48.918 | 68.350 | 57.022 | 83.816 | 28.568 | 42.588 | ||
Face++ [42] | (0.715) | (0.953) | (0.716) | (1.128) | (1.344) | (1.203) | (0.307) | (0.698) | (0.449) | (0.635) | (0.326) | (0.495) | (1.307) | (2.102) | (2.481) | |
57.073 | 48.560 | 52.471 | 55.231 | 63.467 | 59.062 | 66.252 | 86.348 | 74.972 | 56.716 | 81.042 | 66.724 | 86.900 | 23.216 | 36.610 | ||
InsightFace [43] | (0.647) | (0.543) | (0.456) | (0.409) | (0.983) | (0.629) | (0.631) | (0.707) | (0.511) | (1.233) | (0.211) | (0.864) | (0.995) | (1.955) | (2.540) | |
60.054 | 73.680 | 66.170 | 65.961 | 50.987 | 57.509 | 78.990 | 62.066 | 69.500 | 86.136 | 67.294 | 75.554 | 90.354 | 59.582 | 71.784 | ||
Hard | Ours | (0.394) | (0.945) | (0.485) | (0.683) | (0.934) | (0.645) | (0.940) | (1.029) | (0.447) | (0.922) | (1.038) | (0.832) | (0.624) | (2.726) | (2.154) |
Level | Method | A-DFD[19] | |||||
Female | Male | ||||||
precision | recall | F1 | precision | recall | F1 | ||
74.375 | 62.743 | 67.925 | 68.258 | 78.854 | 73.096 | ||
Face++ [42] | (1.442) | (1.445) | (1.256) | (1.197) | (1.544) | (1.228) | |
71.967 | 51.796 | 59.975 | 63.600 | 81.636 | 71.405 | ||
InsightFace [43] | (1.062) | (1.161) | (1.053) | (0.651) | (0.863) | (0.660) | |
72.884 | 83.547 | 77.548 | 78.938 | 66.418 | 71.615 | ||
All | Ours | (0.580) | (0.753) | (0.538) | (0.750) | (0.979) | (0.783) |
95.014 | 82.284 | 88.179 | 84.398 | 95.677 | 89.676 | ||
Face++ [42] | (0.377) | (1.876) | (1.011) | (1.378) | (0.391) | (0.686) | |
94.405 | 75.573 | 83.944 | 79.639 | 95.520 | 86.858 | ||
InsightFace [43] | (0.777) | (0.956) | (0.794) | (0.683) | (0.634) | (0.596) | |
94.434 | 93.600 | 94.013 | 93.659 | 94.480 | 94.066 | ||
Easy | Ours | (0.418) | (0.566) | (0.307) | (0.517) | (0.451) | (0.295) |
65.536 | 60.151 | 62.715 | 63.114 | 68.283 | 65.587 | ||
Face++ [42] | (1.779) | (1.249) | (1.202) | (1.077) | (2.356) | (1.566) | |
64.397 | 47.147 | 54.434 | 58.323 | 73.947 | 65.211 | ||
InsightFace [43] | (1.071) | (1.433) | (1.310) | (0.769) | (0.646) | (0.671) | |
65.158 | 86.133 | 74.188 | 79.536 | 53.920 | 64.258 | ||
Medium | Ours | (0.886) | (0.625) | (0.666) | (0.911) | (1.776) | (1.472) |
62.576 | 45.793 | 52.882 | 57.261 | 72.603 | 64.024 | ||
Face++ [42] | (2.170) | (1.210) | (1.556) | (1.137) | (1.885) | (1.433) | |
57.100 | 32.667 | 41.547 | 52.838 | 75.440 | 62.145 | ||
InsightFace [43] | (1.339) | (1.096) | (1.054) | (0.501) | (1.310) | (0.714) | |
59.062 | 70.907 | 64.442 | 63.618 | 50.853 | 56.520 | ||
Hard | Ours | (0.437) | (1.068) | (0.641) | (0.820) | (0.709) | (0.582) |
A-Celeb-DF-v2 [19] | |||||||
Female | Male | ||||||
Level | Method | precision | recall | F1 | precision | recall | F1 |
97.624 | 83.302 | 89.455 | 86.517 | 98.318 | 91.819 | ||
Face++ [42] | (0.815) | (0.727) | (0.438) | (0.494) | (0.640) | (0.453) | |
97.442 | 85.842 | 91.041 | 88.079 | 98.007 | 92.635 | ||
InsightFace [43] | (0.537) | (0.569) | (0.317) | (0.388) | (0.464) | (0.299) | |
96.381 | 93.611 | 94.921 | 94.147 | 96.687 | 95.355 | ||
All | Ours | (0.756) | (0.564) | (0.400) | (0.422) | (0.782) | (0.426) |
99.889 | 96.480 | 98.155 | 96.598 | 99.893 | 98.218 | ||
Face++ [42] | (0.055) | (0.418) | (0.236) | (0.392) | (0.053) | (0.222) | |
99.837 | 98.107 | 98.964 | 98.140 | 99.840 | 98.983 | ||
InsightFace [43] | (0.054) | (0.352) | (0.180) | (0.340) | (0.053) | (0.174) | |
100.000 | 99.973 | 99.987 | 99.973 | 100.000 | 99.987 | ||
Easy | Ours | (0.000) | (0.053) | (0.027) | (0.053) | (0.000) | (0.027) |
99.732 | 89.227 | 94.185 | 90.260 | 99.760 | 94.771 | ||
Face++ [42] | (0.199) | (0.952) | (0.558) | (0.787) | (0.177) | (0.460) | |
99.639 | 88.320 | 93.638 | 89.513 | 99.680 | 94.323 | ||
InsightFace [43] | (0.226) | (0.496) | (0.354) | (0.412) | (0.200) | (0.298) | |
99.760 | 99.760 | 99.760 | 99.760 | 99.760 | 99.760 | ||
Medium | Ours | (0.053) | (0.177) | (0.100) | (0.176) | (0.053) | (0.100) |
93.251 | 64.200 | 76.024 | 72.694 | 95.300 | 82.469 | ||
Face++ [42] | (2.190) | (0.812) | (0.519) | (0.304) | (1.691) | (0.679) | |
92.850 | 71.100 | 80.521 | 76.584 | 94.500 | 84.600 | ||
InsightFace [43] | (1.331) | (0.860) | (0.417) | (0.412) | (1.140) | (0.424) | |
89.384 | 81.100 | 85.015 | 82.707 | 90.300 | 86.319 | ||
Hard | Ours | (2.214) | (1.463) | (1.073) | (1.037) | (2.294) | (1.150) |
Level | Method | Adience [81] | ||||||||||||||
Female | Male | Young | Middle_Aged | Senior | ||||||||||||
precision | recall | F1 | precision | recall | F1 | precision | recall | F1 | precision | recall | F1 | precision | recall | F1 | ||
71.800 | 87.124 | 76.457 | 84.701 | 71.257 | 76.322 | 89.823 | 68.558 | 77.632 | 51.442 | 70.678 | 58.668 | 50.280 | 76.456 | 60.540 | ||
Face++ [42] | (0.900) | (0.743) | (0.797) | (0.732) | (0.863) | (0.601) | (0.444) | (1.024) | (0.696) | (1.569) | (1.397) | (1.313) | (3.051) | (3.228) | (2.199) | |
67.003 | 74.180 | 68.487 | 73.848 | 67.877 | 69.512 | 86.510 | 37.559 | 51.425 | 37.028 | 65.519 | 45.545 | 26.775 | 72.412 | 38.946 | ||
InsightFace [43] | (0.687) | (1.140) | (0.650) | (1.068) | (0.722) | (0.583) | (1.147) | (1.079) | (1.225) | (0.888) | (1.304) | (0.884) | (1.672) | (2.321) | (2.045) | |
78.103 | 94.416 | 83.852 | 96.915 | 82.496 | 88.602 | 77.517 | 81.767 | 79.578 | 48.206 | 37.327 | 41.843 | 67.352 | 69.370 | 68.241 | ||
All | Ours | (0.732) | (0.338) | (0.603) | (0.382) | (0.607) | (0.443) | (0.500) | (0.963) | (0.579) | (1.794) | (0.529) | (1.061) | (3.613) | (3.679) | (3.258) |
97.977 | 93.613 | 95.658 | 92.325 | 97.319 | 94.754 | 96.754 | 81.939 | 88.730 | 33.469 | 68.722 | 44.960 | 59.861 | 88.473 | 71.165 | ||
Face++ [42] | (0.251) | (0.646) | (0.273) | (0.660) | (0.344) | (0.284) | (0.253) | (0.746) | (0.434) | (2.274) | (1.985) | (2.095) | (4.706) | (4.331) | (2.485) | |
96.613 | 89.815 | 93.087 | 88.180 | 96.009 | 91.925 | 97.076 | 58.063 | 72.662 | 19.981 | 70.808 | 31.150 | 26.572 | 84.972 | 40.411 | ||
InsightFace [43] | (0.210) | (1.000) | (0.537) | (0.980) | (0.331) | (0.540) | (0.455) | (0.819) | (0.717) | (0.942) | (2.276) | (1.185) | (2.943) | (2.722) | (3.521) | |
99.822 | 99.974 | 99.898 | 99.969 | 99.776 | 99.872 | 93.651 | 97.159 | 95.372 | 57.920 | 37.414 | 45.375 | 97.721 | 96.831 | 97.221 | ||
Easy | Ours | (0.104) | (0.052) | (0.078) | (0.063) | (0.124) | (0.094) | (0.599) | (0.507) | (0.455) | (2.428) | (0.217) | (1.469) | (3.063) | (2.986) | (2.037) |
82.280 | 87.626 | 84.863 | 72.271 | 63.091 | 67.350 | 90.557 | 64.622 | 75.413 | 57.229 | 71.100 | 63.397 | 47.362 | 75.972 | 58.295 | ||
Face++ [42] | (1.064) | (0.710) | (0.617) | (1.249) | (1.654) | (0.996) | (0.379) | (1.444) | (1.015) | (1.735) | (1.580) | (1.350) | (2.070) | (3.031) | (1.800) | |
78.900 | 75.366 | 77.087 | 55.589 | 60.493 | 57.923 | 83.417 | 30.372 | 44.510 | 40.618 | 61.066 | 48.776 | 26.303 | 65.071 | 37.453 | ||
InsightFace [43] | (1.250) | (0.827) | (0.818) | (0.852) | (1.375) | (0.637) | (2.029) | (2.139) | (2.587) | (0.934) | (0.713) | (0.618) | (1.300) | (2.274) | (1.616) | |
92.671 | 99.184 | 95.816 | 98.159 | 84.639 | 90.897 | 76.159 | 80.899 | 78.455 | 40.740 | 34.599 | 37.409 | 69.317 | 69.799 | 69.532 | ||
Medium | Ours | (0.607) | (0.280) | (0.385) | (0.599) | (0.735) | (0.505) | (0.372) | (0.913) | (0.492) | (1.307) | (0.808) | (0.812) | (5.848) | (5.737) | (5.645) |
35.144 | 80.134 | 48.851 | 89.507 | 53.361 | 66.860 | 82.159 | 59.114 | 68.752 | 63.628 | 72.212 | 67.646 | 43.617 | 64.923 | 52.162 | ||
Face++ [42] | (1.384) | (0.874) | (1.502) | (0.287) | (0.591) | (0.522) | (0.701) | (0.882) | (0.638) | (0.697) | (0.626) | (0.495) | (2.376) | (2.322) | (2.311) | |
25.495 | 57.358 | 35.287 | 77.776 | 47.129 | 58.688 | 79.037 | 24.242 | 37.102 | 50.485 | 64.681 | 56.708 | 27.452 | 67.194 | 38.973 | ||
InsightFace [43] | (0.602) | (1.594) | (0.596) | (1.373) | (0.460) | (0.572) | (0.956) | (0.280) | (0.372) | (0.789) | (0.923) | (0.848) | (0.773) | (1.967) | (0.999) | |
41.818 | 84.089 | 55.842 | 92.619 | 63.072 | 75.038 | 62.742 | 67.243 | 64.907 | 45.958 | 39.968 | 42.745 | 35.018 | 41.479 | 37.972 | ||
Hard | Ours | (1.484) | (0.684) | (1.345) | (0.486) | (0.962) | (0.730) | (0.530) | (1.468) | (0.790) | (1.648) | (0.563) | (0.902) | (1.929) | (2.314) | (2.092) |
A.2 Phase2: Demographically Annotation Generation
A.2.1 Detailed Information of Datasets
Methods | #Samples | FFHQ | CASIA-WebFace | IMDB-WIKI | CelebA | A-FF+ | A-DFDC | A-DFD | A-Celeb-DF-v2 |
[6] | [37] | [38] | [36] | (Real) [80] | (Real) [80] | (Real) [80] | (Real) [80] | ||
A-FF++ [2] | 105K | ✓ | |||||||
A-DFDC [39] | 37K | ✓ | |||||||
A-DFD [40] | 31K | ✓ | |||||||
A-Celeb-DF-v2 [41] | 155K | ✓ | |||||||
AttGAN [49] | 6K | ✓ | |||||||
MMDGAN [50] | 1K | ✓ | |||||||
StarGAN [49] | 5.6K | ✓ | |||||||
StyleGAN [49] | 10K | ✓ | |||||||
StyleGAN2 [51] | 118K | ✓ | |||||||
StyleGAN3 [52] | 26.7K | ✓ | |||||||
MSG-StyleGAN [50] | 1K | ✓ | |||||||
ProGAN [53] | 100K | ✓ | |||||||
STGAN [50] | 1K | ✓ | |||||||
VQGAN [54] | 50K | ✓ | |||||||
DALLE2 [55] | 204 | ✓ | |||||||
IF [55] | 505 | ✓ | |||||||
Midjourney [55] | 100 | ✓ | |||||||
DCFace [56] | 529K | ✓ | |||||||
Latent Diffusion [57] | 20K | ✓ | |||||||
Palette [58] | 6K | ✓ | |||||||
SD v1.5 [59] | 18K | ✓ | |||||||
SD Inpainting [59] | 20.9K | ✓ | |||||||
Total | 1,245,660 | 70,000 | 474,876 | 26,788 | 202,502 | 21,593 | 37,836 | 8,856 | 23,645 |
866,096 |
Table 13 shows the detailed information of all subsets we collected and incorporated into our AI-Face dataset. It covers fake facial images from deepfake videos, generated from GANs and DMs. The corresponding real sources of most AI-generated face subsets are FFHQ [6] and CelebA [36]. In general, our AI-Face dataset contains 30 subsets (22 fake subsets and 8 real subsets) and 37 generation methods ( methods are summed as 5 in A-FF++, 5 in A-DFD, 8 in A-DFDC, 1 in A-Celeb-DF-v2, 10 GANs, and 8 DMs), including a total of 1,245,660 fake face images and 866,096 real face images. Fig. A.1 visualizes face images of each subset. Fig. A.2 further demonstrates the detailed demographic distribution of our AI-Face dataset. The dataset is relatively gender-balanced, and the subjects are majorly young and white individuals.
A.2.2 Details of Threshold Settings for Human Correction
In this section, we present uncertainty score distributions of each attribute (i.e., Gender, Age, and Race) of each subset in our AI-Face dataset, as shown from Fig. A.4 to Fig. A.31. Overall, our annotator shows higher confidence in predicting gender attributes compared to predicting age, as observed from these uncertainty score distributions. It is clear that different subsets show different distributions, so we dynamically adjust the threshold for each attribute on subset defined in ‘Human Correction’ in Section 3.2. First, we fit the distribution with gamma distribution and calculate its mean and standard deviation. Then, the is calculated using . After getting the threshold, we can get the total image number within each subset that needed human correction. We assume it takes three seconds for a human to correct one annotation for one image, then we can calculate the total time needed for a human to correct these images beyond the threshold. Therefore, The is dynamically adjusted based on the distribution and the total time needed for human correction.
A.2.3 Examples of Mislabeled images in A-FF++ and A-DFDC
In the evaluation results for Q1 in Section 3.2, we have validated that we cannot directly incorporate existing annotations into our AI-Face dataset. Fig. A.3 displays some image examples where annotations in A-FF++ [80] and A-DFDC [80] are inconsistent with the annotations given by our annotator. A-FF and A-DFDC have mislabeled annotations for ambiguous facial images, whereas our annotator can accurately predict them. This visualization of images further validates that existing annotations cannot be directly merged into our dataset.
A.2.4 Additional Results of Validating the Effectiveness of Human Correction Strategy
Since A-DFD [19] and A-Celeb-DF-v2 [19] provide gender annotation, we can compare our two versions of datasets with it. One is Ours before human correction (i.e., Ours-DFD(w/o Correction) and Ours-A-Celeb-DF-v2(w/o Correction)), another one is Ours after human correction (i.e., Ours-DFD (Correction) and Ours-A-Celeb-DF-v2 (Correction)). As same setting as in evaluation for Q2 Section 3.2, we sample 1,200 attribute-balanced images (400 easy, 400 medium, and 400 hard) based on uncertainty score. Three humans re-annotated these images to establish ground truth. As shown in Table 14, Ours-DFD (Correction) and Ours-A-Celeb-DF-v2 (Correction) outperforms ours without correction version and A-DFD and A-Celeb-DF-v2 (e.g., the accuracy of Ours-DFD (Correction) is 22.866% higher than A-DFD and 13.526% higher than Ours-DFD(w/o Correction)). This suggests that our dataset annotation quality is much better than the existing annotation in A-DFD [19] and A-Celeb-DF-v2 [19]. And our human correction strategy further improves our dataset annotation quality.
Gender | ||||
ACC | Precision | Recall | F1 | |
A-DFD [19] | 70.612 | 71.347 | 74.245 | 69.900 |
Ours-DFD(w/o Correction) | 79.952 | 79.868 | 83.979 | 79.308 |
Ours-DFD (Correction) | 93.478 | 91.673 | 95.034 | 92.898 |
A-Celeb-DF-v2 [19] | 89.697 | 90.622 | 90.622 | 89.697 |
Ours-A-Celeb-DF-v2(w/o Correction) | 91.414 | 91.404 | 91.831 | 91.391 |
Ours-A-Celeb-DF-v2 (Correction) | 93.535 | 93.655 | 94.087 | 93.525 |
Appendix B Fairness Benchmark
B.1 Details of Detection Methods
Model Type | Detector | Backbone | GitHub Link | VENUE |
Naive | Xception [61] | Xception | https://github.com/ondyari/FaceForensics/blob/master | ICCV-2019 |
Efficient-B4 [62] | EfficientNet | https://github.com/lukemelas/EfficientNet-PyTorch | ICML-2019 | |
ViT-B/16 [63] | Transformer | https://github.com/lucidrains/vit-pytorch | ICLR-2021 | |
Spatial | UCF [16] | Xception | https://github.com/SCLBD/DeepfakeBench/tree/main | ICCV-2023 |
UnivFD [67] | CLIP VIT | https://github.com/Yuheng-Li/UniversalFakeDetect | CVPR-2023 | |
CORE [68] | Xception | https://github.com/niyunsheng/CORE | CVPRW-2022 | |
Frequency | F3Net [64] | Xception | https://github.com/yyk-wew/F3Net | ECCV-2020 |
SRM [66] | Xception | https://github.com/SCLBD/DeepfakeBench/tree/main | CVPR-2021 | |
SPSL [65] | Xception | https://github.com/SCLBD/DeepfakeBench/tree/main | CVPR-2021 | |
Fairness- enhanced | DAW-FDD [20] | Xception | Unpublished code, reproduced by us | WACV-2024 |
DAG-FDD [20] | Xception | Unpublished code, reproduced by us | WACV-2024 | |
PG-FDD [21] | Xception | https://github.com/Purdue-M2/Fairness-Generalization | CVPR-2024 |
Xception [61]: is a deep convolutional neural network (CNN) architecture that relies on depthwise separable convolutions. This approach significantly reduces the number of parameters and computational cost while maintaining high performance. Xception serves as a classic backbone in deepfake detectors.
EfficientB4 [62]: is part of the EfficientNet family [62], which utilizes a novel model scaling method that uniformly scales all dimensions of depth, width, and resolution using a compound coefficient. EfficientNet also serves as a classic backbone in deepfake detectors.
ViT-B/16 [63]: is a model that applies the transformer architecture, the ’B’ denotes the base model size, and ’16’ indicates the patch size. ViT-B/16 splits images into 16 patches, linearly embeds each patch, adds positional embeddings, and feeds the resulting sequence of vectors into a standard transformer encoder.
F3Net [64]: utilizes a cross-attention two-stream network to effectively identify frequency-aware clues by integrating two branches: FAD and LFS. The FAD (Frequency-aware Decomposition) module divides the input image into various frequency bands using learnable partitions, representing the image with frequency-aware components to detect forgery patterns through this decomposition. Meanwhile, the LFS (Localized Frequency Statistics) module captures local frequency statistics to highlight statistical differences between authentic and counterfeit faces.
SPSL [65]: integrates spatial image data with the phase spectrum to detect up-sampling artifacts in face forgeries, enhancing the model’s generalization ability for face forgery detection. The paper provides a theoretical analysis of the effectiveness of using the phase spectrum. Additionally, it highlights that local texture information is more important than high-level semantic information for accurately detecting face forgeries.
SRM [66]: extracts high-frequency noise features and combines two different representations from the RGB and frequency domains to enhance the model’s generalization ability for face forgery detection.
UCF [16]: presents a multi-task disentanglement framework designed to tackle two key challenges in deepfake detection: overfitting to irrelevant features and overfitting to method-specific textures. By identifying and leveraging common features, this framework aims to improve the model’s generalization ability.
UnivFD [67]: uses the frozen CLIP ViT-L/14 [44] as feature extractor and trains the last linear layer to classify fake and real images.
CORE [68]: explicitly enforces the consistency of different representations. It first captures various representations through different augmentations and then regularizes the cosine distance between these representations to enhance their consistency.
DAW-FDD [20]: a demographic-aware Fair Deepfake Detection (DAW-FDD) method leverages demographic information and employs an existing fairness risk measure [82]. At a high level, DAW-FDD aims to ensure that the losses achieved by different user-specified groups of interest (e.g., different races or genders) are similar to each other (so that the AI face detector is not more accurate on one group vs another) and, moreover, that the losses across all groups are low. Specifically, DAW-FDD uses a CVaR [83, 84] loss function across groups (to address imbalance in demographic groups) and, per group, DAW-FDD uses another CVaR loss function (to address imbalance in real vs AI-generated training examples).
DAG-FDD [20]: a demographic-agnostic Fair Deepfake Detection (DAG-FDD) method, which is based on the distributionally robust optimization (DRO) [85, 86]. To use DAG-FDD, the user does not have to specify which attributes to treat as sensitive such as race and gender, only need to specify a probability threshold for a minority group without explicitly identifying all possible groups.
PG-FDD [21]: PG-FDD (Preserving Generalization Fair Deepfake Detection) employs disentanglement learning to extract demographic and domain-agnostic forgery features, promoting fair learning across a flattened loss landscape. Its framework combines disentanglement learning, fairness learning, and optimization modules. The disentanglement module introduces a loss to expose demographic and domain-agnostic features that enhance fairness generalization. The fairness learning module combines these features to promote fair learning, guided by generalization principles. The optimization module flattens the loss landscape, helping the model escape suboptimal solutions and strengthen fairness generalization.
B.2 Implementation Details
For fairness benchmark, all experiments are based on the PyTorch with a single NVIDIA RTX A6000 GPU. During training, we utilize SGD optimizer with a learning rate of 0.0005, with momentum of 0.9 and weight decay of 0.005. The batch size is set to 128 for most detectors. However, for the SRM [66], UCF [16], and PG-FDD [21], the batch size is adjusted to 32 due to GPU memory. For hyperparameters defined in these detectors, we use the default values set in their original papers. All detectors are initialized with their official pre-trained weights, and trained for 5 epochs.
B.3 Fairness Metrics
We assume a test set comprising indices {1, …, }. and respectively represent the true and predicted labels of the sample . Their values are binary, where 0 means real and 1 means fake. For all fairness metrics, a lower value means better performance.
Where is the demographic variable, is the set of subgroups with each subgroup . is the set of detection models and is the set of fairness metrics. is the rank of detection model for fairness metric . is the total number of fairness metrics. is the set of model types, and is the set of detection models within model type . is the total number of detection models within model type . measures the disparity in TPR or FPR between each subgroup and the overall population. measures the maximum ACC gap across all demographic groups. measures the maximum difference in prediction rates across all demographic groups. And captures the largest disparity in prediction outcomes (either positive or negative) when comparing different demographic groups. in is a predefined scale factor (0.06 in our experiments). represents the predicted logits of the model for input sample . points that a model should be fair across individuals if similar individuals have similar predicted outcomes. is the average fairness rank of detection model , is the average fairness rank of a model type.
B.4 Full Subsets Evaluation Results
Detailed test results of each subset as shown from Table 16 to Table 35 are presented in this section. The findings align with the results reported in Fig. 4.
Model Type | ||||||||||||||||||||||||||||||||||||||
Naive | Frequency | Spatial | Fairness-enhanced | |||||||||||||||||||||||||||||||||||
Dataset | Attribute | Metric |
|
|
|
|
|
|
|
|
|
|
|
|
||||||||||||||||||||||||
FF++ | Gender | 4.353 | 3.346 | 1.161 | 2.595 | 0.887 | 2.492 | 4.916 | 10.873 | 2.516 | 12.198 | 1.606 | 2.214 | |||||||||||||||||||||||||
1.250 | 1.096 | 0.276 | 0.601 | 0.392 | 0.409 | 1.231 | 2.874 | 0.61 | 2.722 | 1.024 | 0.772 | |||||||||||||||||||||||||||
0.177 | 0.132 | 0.396 | 0.426 | 0.231 | 0.228 | 0.489 | 0.015 | 0.196 | 0.977 | 0.941 | 0.095 | |||||||||||||||||||||||||||
4.749 | 4.293 | 1.335 | 2.728 | 1.012 | 2.839 | 5.117 | 12.323 | 3.221 | 12.993 | 2.969 | 2.362 | |||||||||||||||||||||||||||
Race | 10.304 | 9.813 | 7.630 | 15.051 | 6.844 | 22.26 | 9.791 | 23.588 | 4.564 | 22.598 | 2.607 | 15.657 | ||||||||||||||||||||||||||
3.562 | 9.544 | 3.485 | 3.22 | 5.864 | 3.516 | 5.554 | 12.934 | 8.75 | 8.954 | 6.65 | 2.943 | |||||||||||||||||||||||||||
4.465 | 7.396 | 6.232 | 4.045 | 2.541 | 5.227 | 4.388 | 6.889 | 3.522 | 7.382 | 1.939 | 3.764 | |||||||||||||||||||||||||||
17.066 | 27.404 | 12.835 | 20.586 | 11.277 | 36.221 | 17.288 | 70.499 | 11.944 | 48.644 | 6.882 | 18.386 | |||||||||||||||||||||||||||
Age | 9.851 | 5.348 | 5.984 | 6.204 | 3.622 | 14.005 | 9.692 | 15.205 | 9.423 | 24.413 | 1.857 | 6.136 | ||||||||||||||||||||||||||
2.887 | 4.708 | 1.280 | 5.661 | 6.479 | 7.919 | 6.196 | 9.205 | 7.693 | 4.346 | 6.221 | 4.339 | |||||||||||||||||||||||||||
1.038 | 5.813 | 6.417 | 2.049 | 0.856 | 1.581 | 2.606 | 8.927 | 1.138 | 4.263 | 1.472 | 2.112 | |||||||||||||||||||||||||||
18.191 | 11.876 | 8.199 | 13.665 | 5.636 | 20.696 | 17.291 | 29.607 | 14.781 | 47.613 | 6.419 | 12.446 | |||||||||||||||||||||||||||
Intersection | 28.949 | 16.662 | 11.994 | 18.672 | 8.505 | 30.828 | 19.132 | 54.201 | 8.784 | 39.858 | 5.130 | 16.994 | ||||||||||||||||||||||||||
11.648 | 12.215 | 5.127 | 6.721 | 10.157 | 4.449 | 11.268 | 32.584 | 14.697 | 20.864 | 10.087 | 4.831 | |||||||||||||||||||||||||||
8.442 | 10.876 | 10.295 | 7.210 | 4.868 | 8.742 | 5.638 | 15.415 | 8.209 | 10.843 | 3.322 | 4.491 | |||||||||||||||||||||||||||
70.162 | 68.005 | 32.625 | 48.53 | 25.922 | 78.296 | 40.971 | 169.535 | 33.428 | 131.755 | 19.399 | 38.887 | |||||||||||||||||||||||||||
- | ACC | 92.280 | 89.282 | 86.051 | 94.832 | 93.676 | 92.587 | 94.982 | 83.652 | 95.183 | 91.420 | 93.254 | 96.237 | |||||||||||||||||||||||||
AUC | 95.605 | 91.281 | 83.542 | 97.878 | 97.820 | 96.164 | 98.115 | 76.839 | 98.147 | 94.618 | 97.996 | 98.245 | ||||||||||||||||||||||||||
AP | 99.207 | 98.381 | 96.712 | 99.631 | 99.619 | 99.29 | 99.668 | 95.321 | 99.684 | 99.011 | 99.658 | 99.681 | ||||||||||||||||||||||||||
EER | 10.951 | 16.807 | 24.299 | 6.756 | 6.565 | 9.888 | 6.02 | 30.755 | 5.993 | 12.449 | 6.429 | 7.273 |
Model Type | ||||||||||||||||||||||||||||||||||||||
Naive | Frequency | Spatial | Fairness-enhanced | |||||||||||||||||||||||||||||||||||
Dataset | Attribute | Metric |
|
|
|
|
|
|
|
|
|
|
|
|
||||||||||||||||||||||||
DFDC | Gender | 6.011 | 3.708 | 5.567 | 4.415 | 2.357 | 8.711 | 1.492 | 7.444 | 2.062 | 4.87 | 2.944 | 3.687 | |||||||||||||||||||||||||
5.878 | 0.998 | 3.43 | 4.959 | 3.829 | 8.35 | 3.776 | 5.039 | 3.662 | 5.271 | 4.468 | 6.348 | |||||||||||||||||||||||||||
2.97 | 2.744 | 1.88 | 2.645 | 2.537 | 2.438 | 1.427 | 0.075 | 2.178 | 3.012 | 2.233 | 0.869 | |||||||||||||||||||||||||||
6.222 | 5.564 | 7.742 | 5.609 | 3.833 | 10.841 | 2.517 | 12.51 | 3.784 | 4.898 | 3.95 | 4.968 | |||||||||||||||||||||||||||
Race | 8.525 | 6.846 | 24.319 | 7.667 | 9.726 | 9.139 | 11.603 | 22.342 | 10.74 | 15.529 | 11.403 | 5.992 | ||||||||||||||||||||||||||
21.619 | 11.534 | 20.596 | 25.03 | 24.594 | 21.463 | 25.9 | 24.317 | 26.634 | 23.997 | 25.534 | 24.613 | |||||||||||||||||||||||||||
1.622 | 3.701 | 12.756 | 3.048 | 2.816 | 5.793 | 4.722 | 11.051 | 2.699 | 14.659 | 3.46 | 2.523 | |||||||||||||||||||||||||||
26.728 | 15.611 | 47.784 | 25.2 | 25.744 | 24.09 | 22.014 | 65.788 | 23.784 | 47.679 | 26.64 | 12.268 | |||||||||||||||||||||||||||
Age | 6.193 | 7.721 | 17.868 | 5.022 | 7.375 | 10.382 | 4.608 | 13.078 | 5.683 | 20.119 | 5.578 | 3.96 | ||||||||||||||||||||||||||
11.068 | 4.752 | 14.277 | 9.967 | 12.117 | 8.172 | 11.987 | 9.764 | 9.112 | 13.229 | 10.702 | 11.48 | |||||||||||||||||||||||||||
2.817 | 5.951 | 4.984 | 3.918 | 2.585 | 6.092 | 2.513 | 7.523 | 3.869 | 12.581 | 2.498 | 1.653 | |||||||||||||||||||||||||||
14.397 | 16.233 | 26.396 | 14.327 | 13.88 | 16.625 | 12.03 | 22.816 | 11.274 | 31.018 | 8.736 | 6.954 | |||||||||||||||||||||||||||
Intersection | 14.479 | 15.029 | 33.979 | 14.067 | 24.924 | 14.117 | 16.119 | 38.533 | 17.421 | 20.447 | 18.268 | 10.973 | ||||||||||||||||||||||||||
28.877 | 17.816 | 30.153 | 32.117 | 31.493 | 27.666 | 31.604 | 28.815 | 33.791 | 27.812 | 30.224 | 31.389 | |||||||||||||||||||||||||||
5.619 | 8.088 | 20.771 | 4.456 | 7.453 | 9.306 | 5.922 | 14.994 | 4.423 | 18.877 | 5.642 | 3.832 | |||||||||||||||||||||||||||
72.695 | 60.893 | 111.03 | 59.238 | 67.19 | 60.749 | 63.262 | 133.283 | 58.174 | 90.761 | 64.155 | 33.495 | |||||||||||||||||||||||||||
- | ACC | 81.223 | 71.939 | 71.044 | 87.658 | 87.482 | 83.536 | 89.155 | 64.164 | 88.75 | 81.452 | 88.867 | 92.905 | |||||||||||||||||||||||||
AUC | 90.395 | 80.17 | 81.942 | 95.158 | 95.789 | 91.837 | 96.025 | 72.228 | 95.65 | 91.695 | 95.916 | 97.014 | ||||||||||||||||||||||||||
AP | 91.284 | 81.442 | 82.547 | 95.764 | 96.313 | 92.435 | 96.567 | 75.304 | 96.219 | 92.37 | 96.447 | 97.081 | ||||||||||||||||||||||||||
EER | 18.443 | 28.133 | 26.271 | 12.367 | 10.805 | 15.588 | 10.818 | 33.542 | 10.927 | 17.043 | 10.709 | 8.317 |
Model Type | ||||||||||||||||||||||||||||||||||||||
Naive | Frequency | Spatial | Fairness-enhanced | |||||||||||||||||||||||||||||||||||
Dataset | Attribute | Metric |
|
|
|
|
|
|
|
|
|
|
|
|
||||||||||||||||||||||||
DFD | Gender | 7.052 | 2.139 | 3.857 | 6.07 | 0.269 | 10.037 | 6.095 | 1.039 | 2.257 | 3.605 | 3.059 | 0.717 | |||||||||||||||||||||||||
5.543 | 5.864 | 1.231 | 4.871 | 7.154 | 2.188 | 5.342 | 2.327 | 7.261 | 6.893 | 8.593 | 6.827 | |||||||||||||||||||||||||||
3.445 | 3.199 | 5.624 | 2.212 | 0.232 | 4.177 | 2.313 | 5.785 | 2.198 | 2.996 | 2.609 | 1.241 | |||||||||||||||||||||||||||
8.657 | 3.812 | 3.868 | 6.325 | 0.467 | 10.381 | 6.731 | 1.300 | 4.139 | 5.855 | 3.900 | 1.326 | |||||||||||||||||||||||||||
Race | 5.975 | 6.844 | 12.306 | 5.574 | 0.319 | 18.91 | 6.141 | 20.641 | 6.292 | 10.597 | 6.021 | 11.641 | ||||||||||||||||||||||||||
25.863 | 19.116 | 11.976 | 25.678 | 40.64 | 21.081 | 28.174 | 14.104 | 26.949 | 28.784 | 28.439 | 29.743 | |||||||||||||||||||||||||||
6.002 | 10.714 | 16.754 | 4.602 | 0.206 | 8.565 | 3.89 | 15.842 | 4.917 | 4.594 | 4.125 | 3.797 | |||||||||||||||||||||||||||
16.002 | 17.914 | 24.819 | 14.628 | 0.884 | 32.788 | 15.477 | 51.098 | 17.872 | 19.959 | 13.855 | 17.467 | |||||||||||||||||||||||||||
Age | 14.485 | 13.629 | 2.744 | 9.24 | 0.9 | 9.38 | 10.383 | 6.69 | 10.768 | 10.107 | 10.942 | 5.6 | ||||||||||||||||||||||||||
34.386 | 18.578 | 10.063 | 32.355 | 20.119 | 18.826 | 33.553 | 4.892 | 34.865 | 32.253 | 34.503 | 27.165 | |||||||||||||||||||||||||||
11.001 | 18.255 | 22.847 | 6.943 | 0.434 | 13.797 | 7.41 | 23.7 | 5.315 | 7.635 | 6.256 | 6.095 | |||||||||||||||||||||||||||
22.487 | 33.616 | 6.473 | 15.786 | 1.97 | 13.896 | 18.326 | 12.859 | 16.44 | 14.035 | 13.272 | 8.349 | |||||||||||||||||||||||||||
Intersection | 15.691 | 37.9 | 20.833 | 13.62 | 1.786 | 27.246 | 11.053 | 35.828 | 18.056 | 20.833 | 9.157 | 12.903 | ||||||||||||||||||||||||||
35.824 | 31.581 | 18.295 | 36.56 | 53.771 | 29.097 | 38.828 | 28.054 | 38.536 | 41.172 | 39.027 | 41.388 | |||||||||||||||||||||||||||
9.913 | 15.939 | 21.972 | 6.863 | 1.322 | 11.216 | 6.327 | 22.706 | 7.158 | 6.101 | 6.097 | 5.31 | |||||||||||||||||||||||||||
46.408 | 79.825 | 68.93 | 42.743 | 7.073 | 91.155 | 41.325 | 111.273 | 53.592 | 49.779 | 40.678 | 41.822 | |||||||||||||||||||||||||||
- | ACC | 93.039 | 88.321 | 83.862 | 94.6 | 99.505 | 91.405 | 94.984 | 80.753 | 94.761 | 92.99 | 94.6 | 97.102 | |||||||||||||||||||||||||
AUC | 97.507 | 93.914 | 89.886 | 98.478 | 99.942 | 96.347 | 98.592 | 82.817 | 98.651 | 97.659 | 98.813 | 99.082 | ||||||||||||||||||||||||||
AP | 99.349 | 98.366 | 97.059 | 99.596 | 99.965 | 98.929 | 99.614 | 95.008 | 99.62 | 99.375 | 99.687 | 99.75 | ||||||||||||||||||||||||||
EER | 8.086 | 13.377 | 18.014 | 6.183 | 0.500 | 10.048 | 6.124 | 24.911 | 5.945 | 7.788 | 5.529 | 5.470 |
Model Type | ||||||||||||||||||||||||||||||||||||||
Naive | Frequency | Spatial | Fairness-enhanced | |||||||||||||||||||||||||||||||||||
Dataset | Attribute | Metric |
|
|
|
|
|
|
|
|
|
|
|
|
||||||||||||||||||||||||
Celeb- DF-v2 | Gender | 1.764 | 8.434 | 10.889 | 0.584 | 2.701 | 2.377 | 2.645 | 13.511 | 2.78 | 1.312 | 0.997 | 1.584 | |||||||||||||||||||||||||
6.072 | 7.227 | 0.405 | 6.219 | 8.541 | 7.023 | 7 | 9.706 | 6.693 | 6.071 | 7.104 | 6.023 | |||||||||||||||||||||||||||
1.578 | 2.063 | 5.663 | 1.092 | 2.149 | 0.976 | 0.884 | 8.053 | 0.636 | 2.411 | 0.831 | 0.429 | |||||||||||||||||||||||||||
2.585 | 10.238 | 11.073 | 1.108 | 3.236 | 3.379 | 3.564 | 20.484 | 3.369 | 1.599 | 1.519 | 1.601 | |||||||||||||||||||||||||||
Race | 5.583 | 9.879 | 14.539 | 7.288 | 8.943 | 9.753 | 4.16 | 32.306 | 4.502 | 21.999 | 8.275 | 7.45 | ||||||||||||||||||||||||||
19.474 | 19.627 | 14.411 | 21.812 | 24.643 | 16.882 | 20.953 | 12.222 | 21.337 | 22.694 | 24.787 | 16.744 | |||||||||||||||||||||||||||
6.569 | 9.664 | 12.618 | 4.032 | 6.035 | 5.493 | 3.524 | 10.82 | 3.813 | 3.092 | 5.392 | 2.815 | |||||||||||||||||||||||||||
10.691 | 25.652 | 28.759 | 13.013 | 11.726 | 15.42 | 14.684 | 63.524 | 9.671 | 58.225 | 14.714 | 12.384 | |||||||||||||||||||||||||||
Age | 7.172 | 7.331 | 15.248 | 6.974 | 1.948 | 8.784 | 3.873 | 29.904 | 3.539 | 5.903 | 2.508 | 5.968 | ||||||||||||||||||||||||||
33.004 | 25.16 | 6.737 | 33.891 | 33.072 | 33.648 | 32.236 | 18.794 | 32.986 | 24.932 | 34.577 | 32.264 | |||||||||||||||||||||||||||
1.925 | 8.576 | 26.359 | 1.628 | 1.532 | 1.149 | 2.502 | 12.526 | 3.482 | 10.577 | 0.845 | 1.183 | |||||||||||||||||||||||||||
11.497 | 14.073 | 19.966 | 11.657 | 5.013 | 11.178 | 7.669 | 53.72 | 9.685 | 10.027 | 5.404 | 7.037 | |||||||||||||||||||||||||||
Intersection | 20 | 32.79 | 57.779 | 14.286 | 28.571 | 16.19 | 14.286 | 58.368 | 16.774 | 25.477 | 14.286 | 12.381 | ||||||||||||||||||||||||||
76.368 | 78.595 | 67.795 | 76.672 | 77.839 | 77.371 | 76.863 | 67.188 | 76.881 | 77.935 | 75.761 | 77.349 | |||||||||||||||||||||||||||
19.231 | 16.228 | 49.562 | 11.538 | 8.463 | 7.334 | 7.692 | 29.689 | 11.538 | 5.769 | 7.692 | 5.769 | |||||||||||||||||||||||||||
71.129 | 114.538 | 103.126 | 53.655 | 59.887 | 59.694 | 60.765 | 182.729 | 61.653 | 141.381 | 48.621 | 33.495 | |||||||||||||||||||||||||||
- | ACC | 97.43 | 95.129 | 91.548 | 98.145 | 97.511 | 98.073 | 98.263 | 88.191 | 98.221 | 96.073 | 98.405 | 98.754 | |||||||||||||||||||||||||
AUC | 99.345 | 97.548 | 96.504 | 99.652 | 99.579 | 99.448 | 99.684 | 83.086 | 99.685 | 98.377 | 99.702 | 99.815 | ||||||||||||||||||||||||||
AP | 99.908 | 99.641 | 99.492 | 99.953 | 99.943 | 99.923 | 99.957 | 97.068 | 99.957 | 99.763 | 99.96 | 99.974 | ||||||||||||||||||||||||||
EER | 3.733 | 8.041 | 9.747 | 2.189 | 2.074 | 2.857 | 2.051 | 25.184 | 2.143 | 6.382 | 1.636 | 2.281 |
Model Type | ||||||||||||||||||||||||||||||||||||||
Naive | Frequency | Spatial | Fairness-enhanced | |||||||||||||||||||||||||||||||||||
Dataset | Attribute | Metric |
|
|
|
|
|
|
|
|
|
|
|
|
||||||||||||||||||||||||
AttGAN | Gender | 0.56 | 1.669 | 0.472 | 0.946 | 0.459 | 1.554 | 0.422 | 4.79 | 3.544 | 1.941 | 0.153 | 1.47 | |||||||||||||||||||||||||
10.923 | 12.249 | 10.678 | 11.069 | 11.068 | 10.489 | 11.295 | 9.998 | 12.045 | 11.936 | 11.171 | 12.205 | |||||||||||||||||||||||||||
0.619 | 0.287 | 0.739 | 1.053 | 0.432 | 1.136 | 0.085 | 3.38 | 0.75 | 0.721 | 0.165 | 0.288 | |||||||||||||||||||||||||||
1.096 | 3.05 | 0.816 | 1.695 | 0.617 | 2.078 | 0.676 | 6.049 | 3.62 | 2.379 | 0.177 | 2.275 | |||||||||||||||||||||||||||
Race | 3.39 | 3.613 | 3.228 | 3.198 | 3.918 | 4.013 | 3.03 | 18.643 | 17.655 | 7.576 | 1.887 | 1.695 | ||||||||||||||||||||||||||
11.859 | 13.88 | 11.05 | 12.876 | 10.628 | 11.582 | 13.059 | 22.615 | 16.834 | 13.753 | 13.054 | 13.502 | |||||||||||||||||||||||||||
1.587 | 2.174 | 2.526 | 2.387 | 2.033 | 2.31 | 2.521 | 4.713 | 5.636 | 5.042 | 1.6 | 1.6 | |||||||||||||||||||||||||||
5.016 | 10.539 | 9.994 | 5.975 | 9.472 | 9.239 | 5.592 | 37.89 | 19.269 | 12.917 | 4.291 | 5.117 | |||||||||||||||||||||||||||
Age | 3.086 | 4.899 | 7.144 | 1.194 | 3.704 | 3.096 | 2.469 | 15.211 | 5.996 | 2.855 | 2.206 | 5.493 | ||||||||||||||||||||||||||
22.439 | 21.386 | 18.175 | 23.439 | 22.14 | 22.789 | 24.491 | 20.473 | 21.175 | 22.14 | 24.789 | 21.193 | |||||||||||||||||||||||||||
1.105 | 3.595 | 4.689 | 0.942 | 2.563 | 3.132 | 2.456 | 6.436 | 4.493 | 2.309 | 0.398 | 3.758 | |||||||||||||||||||||||||||
5.209 | 10.255 | 13.136 | 4.103 | 10.371 | 10.312 | 5.807 | 36.092 | 8.639 | 6.932 | 3.746 | 7.407 | |||||||||||||||||||||||||||
Intersection | 5.128 | 11.111 | 11.111 | 7.692 | 7.407 | 6.667 | 7.407 | 31.774 | 33.333 | 7.692 | 3.125 | 5 | ||||||||||||||||||||||||||
20.594 | 24.253 | 20.152 | 21.106 | 20.783 | 19.375 | 22.003 | 28.514 | 28.753 | 21.677 | 22.003 | 23.411 | |||||||||||||||||||||||||||
4.225 | 7.042 | 4.968 | 5.634 | 3.177 | 4.878 | 4.348 | 16.17 | 10.976 | 5.479 | 2.817 | 1.852 | |||||||||||||||||||||||||||
21.471 | 42.762 | 35.107 | 23.6 | 25.943 | 31.389 | 22.546 | 92.264 | 46.215 | 33.053 | 12.897 | 17.18 | |||||||||||||||||||||||||||
- | ACC | 98.482 | 97.884 | 95.86 | 98.62 | 98.482 | 98.666 | 98.712 | 80.957 | 96.274 | 97.608 | 99.264 | 99.126 | |||||||||||||||||||||||||
AUC | 99.798 | 99.526 | 99.259 | 99.776 | 99.702 | 99.642 | 99.875 | 89.719 | 98.721 | 99.722 | 99.781 | 99.953 | ||||||||||||||||||||||||||
AP | 99.795 | 99.492 | 99.282 | 99.797 | 99.612 | 99.587 | 99.888 | 91.76 | 98.646 | 99.732 | 99.827 | 99.958 | ||||||||||||||||||||||||||
EER | 1.594 | 2.092 | 4.084 | 1.494 | 1.494 | 1.394 | 1.195 | 18.426 | 4.98 | 2.39 | 0.996 | 1.494 |
Model Type | ||||||||||||||||||||||||||||||||||||||
Naive | Frequency | Spatial | Fairness-enhanced | |||||||||||||||||||||||||||||||||||
Dataset | Attribute | Metric |
|
|
|
|
|
|
|
|
|
|
|
|
||||||||||||||||||||||||
MMDGAN | Gender | 2.722 | 17.144 | 0.773 | 1.087 | 3.809 | 3.261 | 4.622 | 1.626 | 3.394 | 8.439 | 11.417 | 2.448 | |||||||||||||||||||||||||
8.077 | 16.007 | 6.925 | 7.801 | 8.077 | 7.939 | 9.604 | 6.787 | 9.584 | 11.082 | 13.928 | 9.141 | |||||||||||||||||||||||||||
2.335 | 7.348 | 1.084 | 1.271 | 3.398 | 3.26 | 2.797 | 0.63 | 0.512 | 4.275 | 4.994 | 1.133 | |||||||||||||||||||||||||||
4.772 | 18.286 | 0.773 | 2.028 | 6.9 | 6.352 | 5.63 | 3.071 | 4.481 | 9.447 | 12.493 | 2.448 | |||||||||||||||||||||||||||
Race | 10 | 73.95 | 8.974 | 16.667 | 14.286 | 8.333 | 8.333 | 33.333 | 16.667 | 28.571 | 28.571 | 2.21 | ||||||||||||||||||||||||||
33 | 55 | 32 | 33 | 38 | 33 | 33 | 33 | 33 | 43 | 43 | 33 | |||||||||||||||||||||||||||
9.091 | 22 | 5.233 | 9.091 | 5 | 4.545 | 4.545 | 18.182 | 9.091 | 10 | 10 | 1.187 | |||||||||||||||||||||||||||
23.462 | 91.198 | 13.141 | 22.352 | 28.661 | 15.508 | 19.122 | 48.478 | 25.201 | 39.585 | 44.593 | 5.931 | |||||||||||||||||||||||||||
Age | 11.706 | 22.297 | 4.808 | 10.345 | 10.345 | 10.345 | 7.642 | 6.924 | 9.091 | 14.336 | 14.559 | 9.091 | ||||||||||||||||||||||||||
17.703 | 20.303 | 10.909 | 9.394 | 11.515 | 11.212 | 11.818 | 10.303 | 12.727 | 8.788 | 12.727 | 11.818 | |||||||||||||||||||||||||||
3.939 | 10.606 | 2.424 | 5.455 | 9.091 | 6.515 | 4.127 | 3.828 | 2.233 | 6.89 | 8.254 | 5.263 | |||||||||||||||||||||||||||
23.368 | 39.483 | 6.422 | 13.465 | 22.203 | 25.61 | 17.407 | 13.78 | 17.996 | 21.652 | 24.926 | 10.142 | |||||||||||||||||||||||||||
Intersection | 12.5 | 100 | 22.222 | 22.222 | 16.667 | 11.111 | 11.111 | 44.444 | 22.222 | 100 | 33.333 | 4.167 | ||||||||||||||||||||||||||
58.088 | 62.5 | 51.471 | 58.088 | 58.088 | 58.088 | 58.088 | 58.088 | 58.088 | 70.588 | 58.088 | 58.088 | |||||||||||||||||||||||||||
11.765 | 41.667 | 12.5 | 11.765 | 8.333 | 5.882 | 5.882 | 23.529 | 11.765 | 12.5 | 16.667 | 1.948 | |||||||||||||||||||||||||||
51.536 | 230.67 | 59.507 | 44.743 | 59.982 | 39.11 | 42.671 | 103.142 | 55.926 | 161.146 | 85.72 | 14.412 | |||||||||||||||||||||||||||
- | ACC | 97.525 | 90.099 | 95.792 | 98.02 | 97.03 | 98.02 | 97.772 | 93.812 | 96.535 | 94.307 | 96.287 | 99.01 | |||||||||||||||||||||||||
AUC | 99.395 | 97.839 | 99.299 | 99.687 | 99.508 | 99.392 | 99.781 | 97.987 | 98.521 | 99.515 | 99.808 | 99.98 | ||||||||||||||||||||||||||
AP | 98.918 | 97.589 | 99.226 | 99.691 | 99.461 | 99.215 | 99.792 | 97.182 | 98.25 | 99.525 | 99.812 | 99.983 | ||||||||||||||||||||||||||
EER | 2.646 | 7.407 | 3.704 | 1.587 | 2.646 | 2.646 | 2.116 | 6.349 | 4.233 | 1.587 | 1.587 | 0.529 |
Model Type | ||||||||||||||||||||||||||||||||||||||
Naive | Frequency | Spatial | Fairness-enhanced | |||||||||||||||||||||||||||||||||||
Dataset | Attribute | Metric |
|
|
|
|
|
|
|
|
|
|
|
|
||||||||||||||||||||||||
StarGAN | Gender | 0.281 | 2.181 | 0.472 | 0.33 | 0.388 | 0.552 | 0.044 | 2.075 | 1.737 | 0.344 | 0.627 | 0.33 | |||||||||||||||||||||||||
4.379 | 5.639 | 3.887 | 4.547 | 4.555 | 4.744 | 4.461 | 5.32 | 5.209 | 4.678 | 4.954 | 4.65 | |||||||||||||||||||||||||||
0.181 | 0.604 | 0.435 | 0.3 | 0.275 | 0.332 | 0.008 | 0.095 | 0.214 | 0.07 | 0.107 | 0.197 | |||||||||||||||||||||||||||
0.408 | 2.679 | 0.59 | 0.486 | 0.561 | 0.619 | 0.049 | 2.74 | 2.18 | 0.606 | 0.957 | 0.375 | |||||||||||||||||||||||||||
Race | 4 | 4.577 | 11.031 | 4 | 8 | 4 | 8 | 22.727 | 11.197 | 6.113 | 4 | 2.062 | ||||||||||||||||||||||||||
27.875 | 27.493 | 29.459 | 28.39 | 29.39 | 26.875 | 29.086 | 35.768 | 30.974 | 27.403 | 27.875 | 26.056 | |||||||||||||||||||||||||||
1.515 | 3.036 | 4.571 | 1.515 | 3.03 | 3.03 | 3.03 | 6.682 | 2.931 | 1.395 | 1.515 | 1.325 | |||||||||||||||||||||||||||
7.666 | 11.827 | 17.243 | 8.507 | 10.863 | 8.762 | 10.006 | 29.26 | 17.857 | 12.09 | 6.291 | 4.109 | |||||||||||||||||||||||||||
Age | 2.479 | 3.577 | 5.091 | 3.167 | 1.667 | 2.5 | 1.379 | 1.167 | 4.562 | 2.033 | 1.667 | 0.943 | ||||||||||||||||||||||||||
19.078 | 17.476 | 16.434 | 19.802 | 19.399 | 19.078 | 19.319 | 17.244 | 18.927 | 17.39 | 19.078 | 19.158 | |||||||||||||||||||||||||||
1.132 | 2.264 | 2.119 | 0.323 | 1.201 | 2.264 | 1.132 | 2.075 | 0.843 | 0.908 | 1.509 | 0.601 | |||||||||||||||||||||||||||
4.659 | 6.124 | 9.058 | 5.728 | 4.038 | 5.801 | 2.946 | 2.421 | 6.082 | 4.539 | 3.86 | 2.745 | |||||||||||||||||||||||||||
Intersection | 14.286 | 12.5 | 21.774 | 6.25 | 11.111 | 6.25 | 14.286 | 25 | 18.75 | 11.111 | 5.556 | 2.381 | ||||||||||||||||||||||||||
30.971 | 32.154 | 36.599 | 31.973 | 33.612 | 29.932 | 31.571 | 38.639 | 36.417 | 31.791 | 31.571 | 28.326 | |||||||||||||||||||||||||||
5.882 | 5.882 | 7.418 | 2.222 | 4.082 | 4.082 | 5.882 | 8.889 | 5.462 | 4.082 | 2.041 | 1.471 | |||||||||||||||||||||||||||
22.432 | 38.089 | 41.688 | 20.567 | 19.704 | 19.605 | 25.484 | 59.844 | 36.081 | 26.652 | 11.756 | 8.426 | |||||||||||||||||||||||||||
- | ACC | 99.326 | 98.289 | 96.216 | 99.015 | 99.274 | 99.378 | 99.43 | 94.66 | 96.319 | 98.237 | 99.482 | 99.533 | |||||||||||||||||||||||||
AUC | 99.874 | 99.773 | 99.556 | 99.909 | 99.86 | 99.869 | 99.964 | 99.626 | 99.076 | 99.796 | 99.909 | 99.983 | ||||||||||||||||||||||||||
AP | 99.899 | 99.797 | 99.56 | 99.933 | 99.826 | 99.832 | 99.97 | 99.724 | 99.079 | 99.809 | 99.929 | 99.986 | ||||||||||||||||||||||||||
EER | 0.795 | 1.135 | 2.611 | 0.454 | 0.795 | 0.681 | 0.568 | 1.93 | 3.973 | 1.589 | 0.568 | 0.454 |
Model Type | ||||||||||||||||||||||||||||||||||||||
Naive | Frequency | Spatial | Fairness-enhanced | |||||||||||||||||||||||||||||||||||
Dataset | Attribute | Metric |
|
|
|
|
|
|
|
|
|
|
|
|
||||||||||||||||||||||||
StyleGAN | Gender | 0.436 | 2.543 | 1.046 | 1.136 | 1.558 | 0.561 | 0.447 | 3.248 | 3.136 | 0.44 | 0.789 | 0.136 | |||||||||||||||||||||||||
20.675 | 21.927 | 20.009 | 20.923 | 21.864 | 20.013 | 20.847 | 20.617 | 21.168 | 20.521 | 21.311 | 21.15 | |||||||||||||||||||||||||||
0.17 | 0.208 | 0.869 | 0.551 | 0.344 | 0.496 | 0.205 | 1.003 | 0.187 | 0.029 | 0.404 | 0.027 | |||||||||||||||||||||||||||
0.533 | 3.444 | 1.191 | 1.686 | 2.035 | 0.926 | 0.454 | 3.747 | 3.316 | 0.73 | 1.009 | 0.205 | |||||||||||||||||||||||||||
Race | 4.078 | 13.916 | 11.459 | 4.498 | 3.659 | 11.815 | 2.439 | 20.18 | 18.22 | 2.941 | 1.22 | 2.105 | ||||||||||||||||||||||||||
25.95 | 25.845 | 24.097 | 24.671 | 24.566 | 27.593 | 24.207 | 27.273 | 25.481 | 23.849 | 24.053 | 24.257 | |||||||||||||||||||||||||||
1.149 | 5.309 | 4.74 | 1.905 | 1.693 | 4.72 | 0.607 | 8.377 | 7.544 | 0.892 | 0.635 | 1.075 | |||||||||||||||||||||||||||
7.696 | 21.343 | 19.263 | 8.82 | 6.977 | 15.593 | 4.642 | 30.007 | 26.199 | 5.893 | 3.345 | 3.163 | |||||||||||||||||||||||||||
Age | 9.065 | 19.373 | 11.673 | 18.17 | 19.059 | 17.073 | 1.491 | 9.843 | 7.494 | 9.065 | 9.53 | 1.556 | ||||||||||||||||||||||||||
49.291 | 52.488 | 44.166 | 48.832 | 50.785 | 49.475 | 47.41 | 40.723 | 44.455 | 49.356 | 49.553 | 48.55 | |||||||||||||||||||||||||||
0.783 | 2.425 | 14.36 | 2.943 | 2.104 | 1.163 | 1.068 | 10.291 | 8.482 | 0.908 | 1.64 | 1.333 | |||||||||||||||||||||||||||
15.085 | 35.787 | 23.787 | 24.836 | 29.459 | 28.836 | 3.87 | 22.027 | 11.175 | 15.268 | 19.652 | 2.362 | |||||||||||||||||||||||||||
Intersection | 7.407 | 17.306 | 16.78 | 7.143 | 7.143 | 17.857 | 3.704 | 24.774 | 26.04 | 4.054 | 3.571 | 2.817 | ||||||||||||||||||||||||||
47.301 | 51.383 | 47.73 | 47.13 | 50.191 | 48.999 | 47.301 | 50.62 | 50.534 | 47.215 | 48.236 | 48.322 | |||||||||||||||||||||||||||
4.545 | 6.281 | 6.404 | 4.082 | 2.273 | 6.818 | 2.273 | 8.961 | 9.494 | 3.061 | 2.041 | 1.105 | |||||||||||||||||||||||||||
20.331 | 50.793 | 39.145 | 22.806 | 19.225 | 41.872 | 11.758 | 53.806 | 50.322 | 16.809 | 12.616 | 7.417 | |||||||||||||||||||||||||||
- | ACC | 98.975 | 97.819 | 96.347 | 98.476 | 99.08 | 97.976 | 99.527 | 94.77 | 96.399 | 99.054 | 99.448 | 99.685 | |||||||||||||||||||||||||
AUC | 99.925 | 99.794 | 99.392 | 99.861 | 99.964 | 99.902 | 99.985 | 99.703 | 99.51 | 99.892 | 99.986 | 99.979 | ||||||||||||||||||||||||||
AP | 99.94 | 99.854 | 99.386 | 99.904 | 99.97 | 99.916 | 99.988 | 99.756 | 99.316 | 99.925 | 99.989 | 99.981 | ||||||||||||||||||||||||||
EER | 0.982 | 1.443 | 3.753 | 1.501 | 0.693 | 1.27 | 0.52 | 2.887 | 2.483 | 0.982 | 0.52 | 0.462 |
Model Type | ||||||||||||||||||||||||||||||||||||||
Naive | Frequency | Spatial | Fairness-enhanced | |||||||||||||||||||||||||||||||||||
Dataset | Attribute | Metric |
|
|
|
|
|
|
|
|
|
|
|
|
||||||||||||||||||||||||
StyleGAN2 | Gender | 0.847 | 0.56 | 1.556 | 0.976 | 0.487 | 0.666 | 0.27 | 0.447 | 1.073 | 0.534 | 0.241 | 0.045 | |||||||||||||||||||||||||
3.512 | 2.482 | 3.231 | 3.434 | 2.926 | 2.668 | 2.988 | 2.775 | 2.129 | 2.984 | 2.961 | 2.871 | |||||||||||||||||||||||||||
0.077 | 0.341 | 0.624 | 0.246 | 0.342 | 0.636 | 0.092 | 0.117 | 0.698 | 0.31 | 0.092 | 0.022 | |||||||||||||||||||||||||||
1.538 | 0.594 | 1.7 | 1.338 | 0.686 | 1.192 | 0.317 | 0.482 | 1.113 | 0.56 | 0.263 | 0.047 | |||||||||||||||||||||||||||
Race | 1.037 | 5.147 | 6.385 | 1.057 | 1.401 | 6.072 | 1.244 | 15.519 | 16.197 | 2.565 | 0.926 | 0.517 | ||||||||||||||||||||||||||
33.803 | 35.296 | 33.103 | 34.638 | 35.076 | 36.619 | 35.925 | 38.228 | 38.381 | 34.674 | 35.711 | 35.522 | |||||||||||||||||||||||||||
1.451 | 1.354 | 1.4 | 0.471 | 0.506 | 1.907 | 0.229 | 2.583 | 2.74 | 1.331 | 0.38 | 0.247 | |||||||||||||||||||||||||||
2.251 | 7.675 | 13.968 | 2.428 | 2.762 | 7.489 | 2.377 | 17.827 | 19.998 | 5.173 | 2.369 | 1.48 | |||||||||||||||||||||||||||
Age | 2.766 | 2.561 | 8.543 | 3.408 | 3.328 | 2.493 | 2.486 | 9.408 | 9.634 | 6.514 | 2.532 | 0.607 | ||||||||||||||||||||||||||
16.251 | 16.177 | 16.677 | 16.418 | 16.74 | 15.95 | 16.669 | 16.91 | 18.079 | 12.621 | 16.323 | 15.762 | |||||||||||||||||||||||||||
1.016 | 0.375 | 2.35 | 1.05 | 1.265 | 2.008 | 0.647 | 1.74 | 2.042 | 5.433 | 0.97 | 0.528 | |||||||||||||||||||||||||||
5.353 | 2.926 | 13.051 | 5.249 | 4.883 | 6.547 | 3.779 | 12.811 | 10.052 | 11.092 | 3.689 | 1.966 | |||||||||||||||||||||||||||
Intersection | 2.436 | 5.448 | 9.55 | 2.127 | 2.384 | 7.475 | 1.468 | 18.286 | 20.753 | 3.132 | 1.511 | 0.695 | ||||||||||||||||||||||||||
37.77 | 39.222 | 35.56 | 38.81 | 38.411 | 39.991 | 39.446 | 42.732 | 42.186 | 38.965 | 39.128 | 38.732 | |||||||||||||||||||||||||||
1.896 | 2.822 | 2.726 | 0.658 | 1.073 | 2.369 | 0.643 | 3.644 | 4.862 | 1.795 | 0.575 | 0.488 | |||||||||||||||||||||||||||
9.7 | 18.646 | 26.948 | 7.826 | 7.696 | 17.235 | 5.947 | 35.527 | 41.168 | 13.195 | 5.737 | 3.074 | |||||||||||||||||||||||||||
- | ACC | 97.46 | 98.044 | 95.299 | 98.472 | 98.799 | 98.331 | 99.311 | 94.745 | 96.23 | 97.207 | 99.32 | 99.479 | |||||||||||||||||||||||||
AUC | 99.738 | 99.698 | 98.85 | 99.794 | 99.816 | 99.741 | 99.877 | 99.205 | 99.209 | 99.713 | 99.883 | 99.968 | ||||||||||||||||||||||||||
AP | 99.787 | 99.656 | 98.871 | 99.819 | 99.794 | 99.715 | 99.861 | 99.205 | 98.979 | 99.759 | 99.901 | 99.97 | ||||||||||||||||||||||||||
EER | 2.161 | 1.066 | 5.234 | 1.542 | 1.17 | 1.309 | 0.704 | 3.906 | 3.019 | 2.374 | 0.699 | 0.535 |
Model Type | ||||||||||||||||||||||||||||||||||||||
Naive | Frequency | Spatial | Fairness-enhanced | |||||||||||||||||||||||||||||||||||
Dataset | Attribute | Metric |
|
|
|
|
|
|
|
|
|
|
|
|
||||||||||||||||||||||||
StyleGAN3 | Gender | 0.645 | 1.688 | 1.345 | 0.399 | 0.177 | 0.132 | 0.339 | 1.194 | 1.626 | 1.868 | 0.378 | 0.23 | |||||||||||||||||||||||||
5.585 | 6.032 | 4.595 | 5.195 | 5.327 | 5.274 | 5.532 | 5.417 | 5.837 | 6.14 | 5.536 | 5.482 | |||||||||||||||||||||||||||
0.374 | 0.558 | 1.26 | 0.235 | 0.113 | 0.031 | 0.174 | 0.062 | 0.221 | 0.886 | 0.173 | 0.086 | |||||||||||||||||||||||||||
0.774 | 1.792 | 1.843 | 0.56 | 0.254 | 0.219 | 0.382 | 1.305 | 1.842 | 1.952 | 0.421 | 0.264 | |||||||||||||||||||||||||||
Race | 1.701 | 8.361 | 11.792 | 0.605 | 0.498 | 1.546 | 0.893 | 19.275 | 17.174 | 2.079 | 0.708 | 3.61 | ||||||||||||||||||||||||||
41.75 | 42.909 | 41.384 | 42.642 | 43.108 | 42.823 | 43.681 | 45.108 | 45.534 | 42.603 | 43.073 | 44.332 | |||||||||||||||||||||||||||
0.259 | 1.66 | 2.29 | 0.514 | 0.47 | 0.436 | 0.612 | 2.527 | 2.103 | 0.757 | 0.335 | 0.761 | |||||||||||||||||||||||||||
4.543 | 13.614 | 18.916 | 1.177 | 1.092 | 2.955 | 1.47 | 21.59 | 24.212 | 4.237 | 2.164 | 4.885 | |||||||||||||||||||||||||||
Age | 1.403 | 2.138 | 10.825 | 1.727 | 0.743 | 0.459 | 2.1 | 11.432 | 14.27 | 3.777 | 1.792 | 0.892 | ||||||||||||||||||||||||||
14.913 | 14.783 | 17.612 | 14.285 | 14.734 | 14.734 | 14.04 | 17.735 | 19.446 | 15.206 | 13.967 | 15.198 | |||||||||||||||||||||||||||
0.782 | 0.986 | 4.387 | 1.1 | 0.612 | 0.465 | 1.408 | 3.895 | 5.177 | 1.775 | 0.984 | 0.31 | |||||||||||||||||||||||||||
3.378 | 4.108 | 14.744 | 3.685 | 1.714 | 1.141 | 3.498 | 15.191 | 16.05 | 5.888 | 3.075 | 1.073 | |||||||||||||||||||||||||||
Intersection | 2.439 | 10.814 | 14.81 | 1.096 | 1.429 | 2.381 | 1.429 | 24.377 | 22.722 | 3.681 | 2.439 | 4.138 | ||||||||||||||||||||||||||
50.071 | 51.956 | 50.376 | 50.55 | 51.369 | 51.129 | 52.043 | 53.357 | 55.475 | 51.27 | 51.514 | 52.961 | |||||||||||||||||||||||||||
0.893 | 2.808 | 3.841 | 0.644 | 1.013 | 0.526 | 1.124 | 6.702 | 5.604 | 2.306 | 2.143 | 1.429 | |||||||||||||||||||||||||||
11.913 | 30.289 | 40.686 | 3.519 | 4.723 | 6.395 | 5.967 | 44.341 | 52.555 | 14.502 | 7.765 | 12.542 | |||||||||||||||||||||||||||
- | ACC | 98.696 | 98.009 | 95.771 | 98.645 | 99.364 | 99.548 | 99.374 | 94.703 | 96.12 | 97.444 | 99.199 | 99.672 | |||||||||||||||||||||||||
AUC | 99.86 | 99.613 | 99.263 | 99.863 | 99.923 | 99.906 | 99.941 | 99.04 | 98.621 | 99.749 | 99.929 | 99.996 | ||||||||||||||||||||||||||
AP | 99.906 | 99.568 | 99.302 | 99.906 | 99.951 | 99.9 | 99.961 | 99.172 | 98.577 | 99.814 | 99.956 | 99.996 | ||||||||||||||||||||||||||
EER | 1.373 | 1.733 | 4.142 | 1.351 | 0.675 | 0.45 | 0.72 | 4.66 | 5.088 | 2.139 | 0.653 | 0.36 |
Model Type | ||||||||||||||||||||||||||||||||||||||
Naive | Frequency | Spatial | Fairness-enhanced | |||||||||||||||||||||||||||||||||||
Dataset | Attribute | Metric |
|
|
|
|
|
|
|
|
|
|
|
|
||||||||||||||||||||||||
MSG StyleGAN | Gender | 1.136 | 12.078 | 0.703 | 3.409 | 0.654 | 2.219 | 0.654 | 14.301 | 7.359 | 0.614 | 0.039 | 0 | |||||||||||||||||||||||||
27.6 | 31.71 | 25.179 | 26.151 | 27.903 | 26.151 | 27.903 | 15.658 | 29.646 | 27.059 | 27.784 | 28.325 | |||||||||||||||||||||||||||
0.725 | 3.146 | 3.146 | 2.174 | 0.422 | 1.33 | 0.422 | 8.063 | 1.321 | 0.66 | 0.541 | 0 | |||||||||||||||||||||||||||
1.136 | 13.323 | 0.703 | 3.409 | 0.654 | 2.872 | 0.654 | 15.978 | 7.359 | 0.668 | 0.039 | 0 | |||||||||||||||||||||||||||
Race | 0.709 | 12.5 | 18.762 | 9.091 | 0.515 | 17.473 | 0.515 | 18.182 | 13.217 | 2.577 | 12.5 | 0 | ||||||||||||||||||||||||||
49.876 | 46.294 | 36.493 | 50.174 | 49.279 | 32.91 | 49.279 | 41.228 | 43.333 | 48.682 | 48.682 | 49.577 | |||||||||||||||||||||||||||
0.299 | 10.526 | 13.085 | 5.263 | 0.299 | 16.07 | 0.299 | 7.456 | 8.437 | 2.09 | 5.263 | 0 | |||||||||||||||||||||||||||
1.291 | 22.112 | 27.06 | 9.417 | 1.008 | 25.246 | 1.008 | 50.251 | 14.169 | 7.622 | 12.924 | 0 | |||||||||||||||||||||||||||
Age | 2.857 | 37.594 | 5.714 | 2.894 | 0.526 | 16.667 | 0.526 | 33.333 | 12.381 | 3.846 | 15.614 | 0 | ||||||||||||||||||||||||||
47.334 | 49.913 | 48.962 | 47.673 | 49.434 | 50.451 | 49.434 | 42.249 | 51.74 | 48.417 | 51.534 | 49.773 | |||||||||||||||||||||||||||
2.439 | 6.652 | 1.993 | 2.691 | 0.339 | 4.2 | 0.339 | 4.807 | 4.407 | 3.03 | 2.352 | 0 | |||||||||||||||||||||||||||
3.439 | 51.965 | 7.906 | 4.007 | 1.019 | 19.632 | 1.019 | 37.676 | 17.663 | 9.151 | 27.929 | 0 | |||||||||||||||||||||||||||
Intersection | 1.493 | 20 | 41.892 | 11.111 | 0.667 | 25 | 0.667 | 50 | 50 | 2.667 | 20 | 0 | ||||||||||||||||||||||||||
55.853 | 54.067 | 57.778 | 55.853 | 55.407 | 33.631 | 55.407 | 68.889 | 68.889 | 54.514 | 54.514 | 55.853 | |||||||||||||||||||||||||||
0.901 | 14.286 | 17.321 | 7.143 | 0.446 | 22.222 | 0.446 | 20 | 20 | 2.232 | 7.143 | 0 | |||||||||||||||||||||||||||
3.237 | 54.364 | 57.775 | 15.84 | 2.144 | 39.23 | 2.144 | 123.31 | 59.65 | 11.79 | 23.97 | 0 | |||||||||||||||||||||||||||
- | ACC | 99.733 | 95.467 | 95.467 | 99.2 | 99.733 | 98.667 | 99.733 | 86.933 | 96.267 | 98.133 | 98.933 | 100 | |||||||||||||||||||||||||
AUC | 99.997 | 98.943 | 99.834 | 99.994 | 100 | 99.928 | 100 | 94.53 | 96.669 | 99.908 | 100 | 100 | ||||||||||||||||||||||||||
AP | 99.998 | 97.156 | 99.863 | 99.995 | 100 | 99.939 | 100 | 93.162 | 95.249 | 99.922 | 100 | 100 | ||||||||||||||||||||||||||
EER | 0.581 | 4.651 | 2.326 | 0 | 0 | 1.163 | 0 | 11.628 | 6.395 | 1.163 | 0 | 0 |
Model Type | ||||||||||||||||||||||||||||||||||||||
Naive | Frequency | Spatial | Fairness-enhanced | |||||||||||||||||||||||||||||||||||
Dataset | Attribute | Metric |
|
|
|
|
|
|
|
|
|
|
|
|
||||||||||||||||||||||||
ProGAN | Gender | 0.381 | 1.296 | 1.134 | 0.428 | 0.315 | 0.403 | 0.243 | 0.834 | 2.333 | 0.236 | 0.242 | 0.138 | |||||||||||||||||||||||||
16.706 | 17.133 | 15.201 | 16.709 | 16.918 | 16.671 | 17.043 | 15.652 | 16.838 | 16.708 | 17.082 | 16.882 | |||||||||||||||||||||||||||
0.341 | 0.006 | 1.793 | 0.396 | 0.305 | 0.187 | 0.131 | 1.442 | 0.227 | 0.236 | 0.081 | 0.19 | |||||||||||||||||||||||||||
0.504 | 1.501 | 1.137 | 0.610 | 0.613 | 0.437 | 0.257 | 0.947 | 2.389 | 0.304 | 0.271 | 0.167 | |||||||||||||||||||||||||||
Race | 3.598 | 5.357 | 9.424 | 2.721 | 3.565 | 4.177 | 2.743 | 18.043 | 20.491 | 3.027 | 3.149 | 0.822 | ||||||||||||||||||||||||||
35.504 | 30.506 | 25.824 | 35.179 | 36.143 | 35.551 | 35.743 | 22.053 | 18.852 | 35.97 | 35.661 | 34.624 | |||||||||||||||||||||||||||
1.285 | 5.514 | 10.036 | 1.232 | 0.609 | 1.279 | 0.693 | 15.502 | 17.141 | 0.48 | 0.926 | 0.844 | |||||||||||||||||||||||||||
5.759 | 9.542 | 14.448 | 4.235 | 4.912 | 5.619 | 3.788 | 22.113 | 25.053 | 4.519 | 4.381 | 1.968 | |||||||||||||||||||||||||||
Age | 1.374 | 1.953 | 8.853 | 0.898 | 0.798 | 0.932 | 0.656 | 6.036 | 5.397 | 1.262 | 0.656 | 1.244 | ||||||||||||||||||||||||||
21.22 | 22.383 | 19.714 | 21.441 | 21.587 | 21.342 | 21.717 | 19.868 | 21.597 | 21.319 | 21.817 | 21.601 | |||||||||||||||||||||||||||
0.897 | 0.912 | 5.411 | 0.804 | 0.583 | 0.726 | 0.503 | 4.49 | 2.83 | 1.067 | 0.367 | 0.702 | |||||||||||||||||||||||||||
2.875 | 3.954 | 11.106 | 2.453 | 2.53 | 2.744 | 1.763 | 10.015 | 6.825 | 3.491 | 1.646 | 1.582 | |||||||||||||||||||||||||||
Intersection | 4.284 | 6.829 | 10.793 | 6.557 | 6.581 | 8.513 | 6.604 | 20.936 | 24.845 | 6.406 | 6.581 | 0.935 | ||||||||||||||||||||||||||
50.437 | 47.389 | 40.919 | 50.462 | 51.762 | 51.533 | 52.347 | 38.606 | 36.209 | 51.858 | 51.988 | 50.523 | |||||||||||||||||||||||||||
1.882 | 5.836 | 11.532 | 2.432 | 1.162 | 1.529 | 0.818 | 16.738 | 19.088 | 0.956 | 0.993 | 0.961 | |||||||||||||||||||||||||||
13.14 | 22.712 | 31.013 | 13.286 | 13.577 | 16.796 | 11.759 | 42.997 | 52.525 | 13.479 | 12.476 | 3.729 | |||||||||||||||||||||||||||
- | ACC | 99.357 | 98.286 | 96.458 | 99.344 | 99.558 | 99.384 | 99.639 | 95.045 | 96.418 | 99.243 | 99.688 | 99.68 | |||||||||||||||||||||||||
AUC | 99.968 | 99.899 | 99.84 | 99.938 | 99.961 | 99.948 | 99.977 | 99.895 | 99.105 | 99.954 | 99.984 | 99.996 | ||||||||||||||||||||||||||
AP | 99.976 | 99.928 | 99.861 | 99.959 | 99.974 | 99.959 | 99.984 | 99.927 | 98.838 | 99.966 | 99.988 | 99.997 | ||||||||||||||||||||||||||
EER | 0.535 | 0.916 | 1.838 | 0.547 | 0.44 | 0.595 | 0.363 | 0.69 | 3.094 | 0.696 | 0.345 | 0.321 |
Model Type | ||||||||||||||||||||||||||||||||||||||
Naive | Frequency | Spatial | Fairness-enhanced | |||||||||||||||||||||||||||||||||||
Dataset | Attribute | Metric |
|
|
|
|
|
|
|
|
|
|
|
|
||||||||||||||||||||||||
STGAN | Gender | 17.404 | 18.772 | 4.36 | 13.333 | 12.912 | 5.087 | 6.737 | 12.632 | 7.965 | 12.596 | 14.737 | 5.333 | |||||||||||||||||||||||||
16.581 | 15 | 5.161 | 12.984 | 13.419 | 7.823 | 11.774 | 0.097 | 14.984 | 12.935 | 15.645 | 11.194 | |||||||||||||||||||||||||||
7.968 | 10.419 | 2.452 | 8.532 | 7.387 | 4.661 | 2.452 | 12.774 | 0.048 | 5.903 | 6.323 | 2.581 | |||||||||||||||||||||||||||
17.404 | 21.708 | 5.238 | 17.171 | 15.412 | 9.087 | 7.812 | 19.085 | 15.123 | 13.934 | 15.812 | 5.333 | |||||||||||||||||||||||||||
Race | 13.299 | 20 | 4.167 | 11.765 | 17.647 | 4.412 | 17.647 | 50 | 40 | 26.961 | 10 | 5.882 | ||||||||||||||||||||||||||
25 | 32.197 | 18.561 | 20.613 | 19.048 | 19.697 | 22.811 | 13.258 | 29.337 | 13.62 | 18.215 | 20.613 | |||||||||||||||||||||||||||
8.97 | 4.5 | 2.464 | 8 | 12 | 3.297 | 12 | 36.742 | 10.023 | 19.833 | 7.955 | 4 | |||||||||||||||||||||||||||
18.036 | 47.951 | 6.249 | 27.305 | 24.367 | 17.005 | 33.472 | 83.525 | 53.832 | 35.574 | 17.67 | 8.072 | |||||||||||||||||||||||||||
Age | 18.277 | 28.125 | 22.581 | 20 | 23.333 | 8.696 | 20 | 22.5 | 14.146 | 19.916 | 20.784 | 10 | ||||||||||||||||||||||||||
10.656 | 11.688 | 10.343 | 14.52 | 19.438 | 10.82 | 16.159 | 10.134 | 8.765 | 13.574 | 14.844 | 12.881 | |||||||||||||||||||||||||||
10.99 | 20 | 11.475 | 13.115 | 11.475 | 5.455 | 11.475 | 14.637 | 8.67 | 12.404 | 9.115 | 4.918 | |||||||||||||||||||||||||||
26.439 | 43.956 | 36.21 | 42.739 | 41.293 | 15.487 | 29.446 | 43.298 | 35.301 | 48.478 | 32.922 | 13.125 | |||||||||||||||||||||||||||
Intersection | 28.571 | 30.612 | 16.667 | 16.327 | 23.077 | 7.812 | 25 | 66.667 | 66.667 | 38.462 | 24.49 | 7.692 | ||||||||||||||||||||||||||
40.833 | 45 | 39.167 | 35.177 | 32.78 | 35 | 36.947 | 33.333 | 47.677 | 23.82 | 33.038 | 35 | |||||||||||||||||||||||||||
16.667 | 22.222 | 7.143 | 10.619 | 15.789 | 7.08 | 16.667 | 48.333 | 20.833 | 26.316 | 16.667 | 5.263 | |||||||||||||||||||||||||||
73.153 | 131.743 | 58.478 | 75.809 | 70.579 | 43.757 | 85.001 | 236.087 | 150.768 | 101.162 | 90.191 | 22.28 | |||||||||||||||||||||||||||
- | ACC | 93.521 | 88.451 | 94.93 | 95.775 | 95.775 | 97.465 | 94.93 | 69.577 | 93.521 | 90.423 | 93.239 | 98.873 | |||||||||||||||||||||||||
AUC | 99.573 | 96.465 | 98.534 | 99.541 | 99.547 | 97.807 | 99.538 | 85.921 | 97.335 | 99.194 | 99.522 | 99.908 | ||||||||||||||||||||||||||
AP | 99.639 | 95.872 | 98.139 | 99.59 | 99.607 | 97.132 | 99.579 | 82.016 | 96.354 | 99.242 | 99.554 | 99.922 | ||||||||||||||||||||||||||
EER | 4.217 | 7.831 | 3.614 | 3.614 | 3.614 | 3.614 | 3.614 | 19.277 | 6.024 | 3.614 | 3.614 | 3.614 |
Model Type | ||||||||||||||||||||||||||||||||||||||
Naive | Frequency | Spatial | Fairness-enhanced | |||||||||||||||||||||||||||||||||||
Dataset | Attribute | Metric |
|
|
|
|
|
|
|
|
|
|
|
|
||||||||||||||||||||||||
VQGAN | Gender | 0.359 | 1.689 | 0.682 | 0.732 | 0.368 | 0.47 | 0.241 | 1.789 | 1.021 | 0.642 | 0.338 | 0.207 | |||||||||||||||||||||||||
12.221 | 13.117 | 11.324 | 12.105 | 12.491 | 12.466 | 12.581 | 9.224 | 12.203 | 12.216 | 12.609 | 12.755 | |||||||||||||||||||||||||||
0.231 | 0.067 | 1.264 | 0.49 | 0.377 | 0.454 | 0.187 | 0.12 | 0.625 | 0.482 | 0.243 | 0.069 | |||||||||||||||||||||||||||
0.371 | 2.198 | 0.722 | 0.829 | 0.719 | 0.876 | 0.367 | 2.226 | 1.274 | 0.853 | 0.513 | 0.386 | |||||||||||||||||||||||||||
Race | 1.267 | 8.893 | 10.064 | 2.257 | 3.158 | 2.321 | 3.273 | 20.147 | 20.549 | 4.11 | 3.183 | 1.429 | ||||||||||||||||||||||||||
59.881 | 60.677 | 57.134 | 60.756 | 60.89 | 60.225 | 61.269 | 54.127 | 62.9 | 60.342 | 61.099 | 61.113 | |||||||||||||||||||||||||||
0.692 | 1.985 | 3.926 | 0.611 | 0.47 | 0.855 | 0.58 | 9.087 | 2.245 | 0.726 | 0.518 | 0.283 | |||||||||||||||||||||||||||
3.139 | 13.952 | 15.64 | 3.751 | 4.581 | 4.174 | 4.947 | 34.162 | 25.23 | 6.323 | 4.388 | 2.143 | |||||||||||||||||||||||||||
Age | 1.104 | 1.105 | 9.03 | 1.813 | 1.533 | 1.533 | 0.648 | 5.925 | 8.977 | 1.61 | 0.703 | 0.897 | ||||||||||||||||||||||||||
29.715 | 29.264 | 30.54 | 29.734 | 30.239 | 30.338 | 30.415 | 22.037 | 31.068 | 30.059 | 30.343 | 30.537 | |||||||||||||||||||||||||||
0.956 | 0.953 | 2.392 | 0.984 | 0.785 | 0.866 | 0.339 | 4.197 | 3.12 | 0.992 | 0.444 | 0.461 | |||||||||||||||||||||||||||
2.363 | 4.055 | 11.65 | 3.145 | 2.892 | 3.001 | 1.344 | 18.036 | 10.529 | 3.499 | 1.421 | 1.653 | |||||||||||||||||||||||||||
Intersection | 3.846 | 13.893 | 12.515 | 3.44 | 3.504 | 3.112 | 3.671 | 23.965 | 25.638 | 4.721 | 3.525 | 2 | ||||||||||||||||||||||||||
67.678 | 69.217 | 64.056 | 68.414 | 68.785 | 67.971 | 69.072 | 60.076 | 71.221 | 68.725 | 69.149 | 69.359 | |||||||||||||||||||||||||||
1.703 | 3.567 | 5.313 | 0.889 | 1.029 | 1.408 | 0.787 | 10.854 | 3.869 | 2.36 | 1.186 | 0.976 | |||||||||||||||||||||||||||
9.427 | 27.715 | 31.838 | 9.934 | 10.748 | 11.46 | 11.34 | 66.065 | 46.268 | 16.097 | 10.787 | 5.751 | |||||||||||||||||||||||||||
- | ACC | 99.092 | 97.936 | 96.313 | 99.102 | 99.344 | 99.387 | 99.543 | 91.217 | 96.248 | 99.135 | 99.538 | 99.758 | |||||||||||||||||||||||||
AUC | 99.909 | 99.746 | 99.699 | 99.883 | 99.878 | 99.879 | 99.912 | 96.257 | 98.872 | 99.912 | 99.938 | 99.99 | ||||||||||||||||||||||||||
AP | 99.926 | 99.755 | 99.716 | 99.901 | 99.871 | 99.855 | 99.895 | 96.027 | 98.822 | 99.932 | 99.952 | 99.991 | ||||||||||||||||||||||||||
EER | 0.835 | 1.565 | 2.554 | 0.706 | 0.683 | 0.588 | 0.447 | 9.508 | 4.46 | 0.812 | 0.447 | 0.306 |
Model Type | ||||||||||||||||||||||||||||||||||||||
Naive | Frequency | Spatial | Fairness-enhanced | |||||||||||||||||||||||||||||||||||
Dataset | Attribute | Metric |
|
|
|
|
|
|
|
|
|
|
|
|
||||||||||||||||||||||||
Commercial Tools | Gender | 7.689 | 8.221 | 4.167 | 4.432 | 2.879 | 0.435 | 3.258 | 8.864 | 2.708 | 2.348 | 4.432 | 2.083 | |||||||||||||||||||||||||
18.977 | 22.798 | 18.901 | 18.527 | 16.616 | 16.540 | 18.714 | 20.701 | 16.540 | 16.990 | 17.440 | 17.177 | |||||||||||||||||||||||||||
5.961 | 1.613 | 3.337 | 3.337 | 2.326 | 1.350 | 2.250 | 5.137 | 3.524 | 2.700 | 4.424 | 2.887 | |||||||||||||||||||||||||||
10.397 | 14.471 | 5.714 | 4.867 | 3.459 | 0.435 | 4.950 | 10.411 | 4.791 | 2.929 | 7.140 | 2.664 | |||||||||||||||||||||||||||
Race | 33.333 | 14.286 | 33.333 | 28.571 | 33.333 | 28.571 | 28.571 | 28.571 | 33.333 | 33.333 | 33.333 | 33.333 | ||||||||||||||||||||||||||
69.398 | 61.706 | 71.572 | 69.398 | 65.050 | 73.746 | 69.398 | 65.050 | 73.746 | 73.746 | 71.572 | 74.089 | |||||||||||||||||||||||||||
21.053 | 7.018 | 10.526 | 10.526 | 15.789 | 4.348 | 8.696 | 13.043 | 10.526 | 15.789 | 15.789 | 5.263 | |||||||||||||||||||||||||||
77.098 | 29.869 | 62.147 | 52.682 | 73.558 | 36.044 | 48.191 | 50.286 | 63.706 | 70.345 | 70.475 | 57.018 | |||||||||||||||||||||||||||
Age | 28.571 | 28.571 | 14.286 | 28.571 | 6.711 | 12.500 | 14.286 | 14.286 | 14.286 | 14.286 | 14.286 | 5.833 | ||||||||||||||||||||||||||
49.346 | 58.824 | 52.778 | 50.817 | 47.876 | 52.288 | 51.797 | 49.346 | 52.288 | 51.307 | 50.327 | 52.288 | |||||||||||||||||||||||||||
9.225 | 10.205 | 8.170 | 8.170 | 6.566 | 11.111 | 8.170 | 6.566 | 9.641 | 7.680 | 7.680 | 8.660 | |||||||||||||||||||||||||||
37.679 | 36.583 | 21.513 | 43.361 | 20.392 | 13.639 | 27.762 | 28.094 | 22.990 | 22.156 | 22.837 | 8.656 | |||||||||||||||||||||||||||
Intersection | 50.000 | 25.000 | 50.000 | 33.333 | 50.000 | 33.333 | 33.333 | 33.333 | 50.000 | 50.000 | 50.000 | 50.000 | ||||||||||||||||||||||||||
65.714 | 62.637 | 68.889 | 68.000 | 68.000 | 72.000 | 68.000 | 65.714 | 72.000 | 72.000 | 72.000 | 74.444 | |||||||||||||||||||||||||||
22.222 | 8.547 | 11.111 | 11.111 | 19.048 | 5.128 | 9.524 | 20.000 | 11.111 | 16.667 | 16.667 | 7.692 | |||||||||||||||||||||||||||
131.495 | 67.938 | 105.685 | 81.793 | 129.065 | 59.940 | 76.252 | 91.701 | 105.535 | 114.665 | 126.561 | 109.221 | |||||||||||||||||||||||||||
- | ACC | 93.976 | 95.582 | 95.582 | 95.582 | 92.771 | 97.590 | 95.984 | 92.369 | 96.787 | 95.181 | 95.181 | 96.386 | |||||||||||||||||||||||||
AUC | 95.778 | 99.541 | 99.005 | 96.349 | 94.798 | 95.716 | 95.681 | 96.808 | 97.371 | 97.141 | 94.812 | 93.365 | ||||||||||||||||||||||||||
AP | 96.193 | 99.751 | 99.401 | 96.966 | 95.607 | 96.184 | 93.761 | 98.000 | 98.153 | 97.779 | 94.066 | 90.493 | ||||||||||||||||||||||||||
EER | 7.692 | 3.297 | 5.495 | 6.593 | 8.791 | 6.593 | 6.593 | 9.890 | 6.593 | 6.593 | 6.593 | 7.692 |
Model Type | ||||||||||||||||||||||||||||||||||||||
Naive | Frequency | Spatial | Fairness-enhanced | |||||||||||||||||||||||||||||||||||
Dataset | Attribute | Metric |
|
|
|
|
|
|
|
|
|
|
|
|
||||||||||||||||||||||||
DCFace | Gender | 0.525 | 0.87 | 1.201 | 0.637 | 0.196 | 0.338 | 0.052 | 0.596 | 1.77 | 0.368 | 0.066 | 0.04 | |||||||||||||||||||||||||
5.465 | 4.56 | 5.28 | 5.534 | 5.342 | 5.371 | 5.274 | 4.013 | 4.012 | 5.351 | 5.265 | 5.213 | |||||||||||||||||||||||||||
0.191 | 0.412 | 0.311 | 0.214 | 0.062 | 0.163 | 0.015 | 0.191 | 1.155 | 0.161 | 0.048 | 0.018 | |||||||||||||||||||||||||||
0.549 | 0.873 | 1.393 | 0.704 | 0.231 | 0.343 | 0.066 | 0.944 | 1.86 | 0.415 | 0.085 | 0.076 | |||||||||||||||||||||||||||
Race | 0.667 | 5.55 | 8.151 | 0.608 | 0.737 | 0.591 | 0.663 | 16.794 | 18.582 | 0.78 | 0.669 | 0.938 | ||||||||||||||||||||||||||
18.219 | 18.877 | 20.375 | 18.181 | 18.285 | 18.201 | 18.441 | 26.177 | 25.501 | 18.289 | 18.292 | 18.522 | |||||||||||||||||||||||||||
0.535 | 2.088 | 3.419 | 0.384 | 0.294 | 0.359 | 0.431 | 5.528 | 6.893 | 0.322 | 0.304 | 0.392 | |||||||||||||||||||||||||||
2.333 | 10.815 | 14.05 | 1.889 | 1.94 | 1.649 | 2.154 | 29.148 | 23.112 | 1.838 | 1.661 | 1.454 | |||||||||||||||||||||||||||
Age | 1.448 | 2.989 | 9.273 | 1.071 | 0.567 | 1.055 | 0.425 | 6.914 | 7.59 | 1.253 | 0.706 | 0.594 | ||||||||||||||||||||||||||
12.708 | 13.621 | 9.357 | 12.926 | 13.119 | 12.754 | 13.08 | 10.929 | 8.445 | 12.763 | 13.314 | 12.741 | |||||||||||||||||||||||||||
0.918 | 1.218 | 4.94 | 0.772 | 0.501 | 0.831 | 0.358 | 6.765 | 4.839 | 0.973 | 0.4 | 0.388 | |||||||||||||||||||||||||||
2.782 | 6.577 | 12.53 | 2.419 | 1.552 | 2.202 | 1.108 | 15.928 | 8.071 | 2.468 | 1.232 | 0.924 | |||||||||||||||||||||||||||
Intersection | 1.327 | 7.377 | 9.272 | 1.463 | 0.892 | 0.923 | 0.866 | 19.337 | 22.619 | 0.886 | 0.868 | 1.29 | ||||||||||||||||||||||||||
21.136 | 20.833 | 22.531 | 21.04 | 20.964 | 20.906 | 21.006 | 27.504 | 28.454 | 21.116 | 20.984 | 21.006 | |||||||||||||||||||||||||||
0.764 | 3.089 | 4.017 | 0.649 | 0.362 | 0.518 | 0.498 | 6.588 | 8.981 | 0.548 | 0.391 | 0.568 | |||||||||||||||||||||||||||
5.666 | 22.247 | 27.649 | 6.712 | 4.452 | 4.176 | 4.367 | 58.647 | 45.831 | 5.043 | 3.709 | 2.873 | |||||||||||||||||||||||||||
- | ACC | 99.361 | 96.935 | 96.038 | 99.314 | 99.542 | 99.395 | 99.627 | 92.834 | 96.443 | 99.329 | 99.654 | 99.727 | |||||||||||||||||||||||||
AUC | 99.961 | 99.513 | 99.718 | 99.938 | 99.934 | 99.956 | 99.965 | 97.415 | 99.129 | 99.956 | 99.965 | 99.994 | ||||||||||||||||||||||||||
AP | 99.972 | 99.602 | 99.776 | 99.955 | 99.947 | 99.963 | 99.97 | 97.347 | 98.913 | 99.969 | 99.977 | 99.995 | ||||||||||||||||||||||||||
EER | 0.422 | 3.07 | 2.649 | 0.414 | 0.414 | 0.612 | 0.363 | 7.661 | 3.26 | 0.515 | 0.368 | 0.322 |
Model Type | ||||||||||||||||||||||||||||||||||||||
Naive | Frequency | Spatial | Fairness-enhanced | |||||||||||||||||||||||||||||||||||
Dataset | Attribute | Metric |
|
|
|
|
|
|
|
|
|
|
|
|
||||||||||||||||||||||||
Latent Diffusion | Gender | 0.711 | 1.315 | 6.822 | 0.587 | 0.269 | 0.404 | 0.343 | 0.523 | 1.341 | 0.709 | 0.343 | 0.005 | |||||||||||||||||||||||||
6.704 | 7.776 | 1.172 | 6.844 | 7.154 | 7.227 | 7.274 | 6.063 | 7.26 | 6.444 | 7.367 | 7.15 | |||||||||||||||||||||||||||
0.339 | 0.192 | 3.448 | 0.376 | 0.232 | 0.315 | 0.206 | 0.052 | 0.059 | 0.069 | 0.113 | 0.02 | |||||||||||||||||||||||||||
0.723 | 1.748 | 8.477 | 0.706 | 0.467 | 0.658 | 0.411 | 0.993 | 1.39 | 1.186 | 0.464 | 0.005 | |||||||||||||||||||||||||||
Race | 1.377 | 7.837 | 10.67 | 1.602 | 0.319 | 0.559 | 1.116 | 19.547 | 20.291 | 0.763 | 0.633 | 0.503 | ||||||||||||||||||||||||||
39.89 | 41.058 | 30.225 | 39.89 | 40.64 | 40.819 | 40.462 | 39.759 | 42.918 | 40.116 | 40.95 | 40.593 | |||||||||||||||||||||||||||
1.202 | 1.262 | 9.842 | 0.921 | 0.206 | 0.387 | 0.658 | 4.85 | 2.278 | 0.691 | 0.299 | 0.31 | |||||||||||||||||||||||||||
2.443 | 10.604 | 26.477 | 3.081 | 0.884 | 1.541 | 2.659 | 30.757 | 21.909 | 2.097 | 1.363 | 0.765 | |||||||||||||||||||||||||||
Age | 2.771 | 2.325 | 12.515 | 1.798 | 0.9 | 0.803 | 0.762 | 3.544 | 4.117 | 1.896 | 0.571 | 1.604 | ||||||||||||||||||||||||||
20.503 | 20.83 | 20.755 | 20.183 | 20.119 | 19.742 | 19.955 | 15.919 | 20.598 | 20.183 | 20.062 | 20.823 | |||||||||||||||||||||||||||
0.913 | 0.495 | 5.504 | 0.508 | 0.434 | 0.505 | 0.275 | 2.119 | 0.437 | 1.088 | 0.319 | 0.518 | |||||||||||||||||||||||||||
4.881 | 4.303 | 35.584 | 3.991 | 1.97 | 1.782 | 1.411 | 11.437 | 6.184 | 4.923 | 1.425 | 2.165 | |||||||||||||||||||||||||||
Intersection | 3.571 | 10.805 | 13.52 | 3.846 | 1.786 | 1.786 | 1.786 | 22.411 | 24.881 | 3.226 | 1.786 | 0.714 | ||||||||||||||||||||||||||
52.751 | 53.892 | 39.806 | 52.751 | 53.771 | 54.281 | 53.441 | 50.799 | 55.935 | 52.87 | 54.461 | 53.621 | |||||||||||||||||||||||||||
3.061 | 2.274 | 13.761 | 2.643 | 1.322 | 1.322 | 1.531 | 5.062 | 2.708 | 1.442 | 0.541 | 0.51 | |||||||||||||||||||||||||||
10.619 | 23.761 | 61.046 | 11.037 | 7.073 | 5.892 | 8.286 | 57.444 | 45.514 | 11.596 | 5.021 | 1.768 | |||||||||||||||||||||||||||
- | ACC | 99.066 | 98.528 | 88.706 | 99.179 | 99.505 | 99.674 | 99.646 | 92.669 | 96.519 | 98.981 | 99.689 | 99.887 | |||||||||||||||||||||||||
AUC | 99.921 | 99.948 | 96.795 | 99.908 | 99.942 | 99.968 | 99.972 | 97.153 | 98.926 | 99.916 | 99.971 | 99.999 | ||||||||||||||||||||||||||
AP | 99.945 | 99.961 | 96.469 | 99.931 | 99.965 | 99.976 | 99.983 | 96.901 | 98.668 | 99.94 | 99.981 | 99.999 | ||||||||||||||||||||||||||
EER | 0.906 | 0.531 | 9.031 | 0.688 | 0.5 | 0.406 | 0.375 | 8 | 3.719 | 1 | 0.469 | 0.156 |
Model Type | ||||||||||||||||||||||||||||||||||||||
Naive | Frequency | Spatial | Fairness-enhanced | |||||||||||||||||||||||||||||||||||
Dataset | Attribute | Metric |
|
|
|
|
|
|
|
|
|
|
|
|
||||||||||||||||||||||||
Palette | Gender | 1.164 | 0.544 | 1.2 | 1.164 | 1.121 | 2.159 | 1.611 | 9.265 | 0.757 | 1.348 | 0.503 | 0.727 | |||||||||||||||||||||||||
13.196 | 13.889 | 12.155 | 13.052 | 13.571 | 13.763 | 13.705 | 2.952 | 13.466 | 12.548 | 13.542 | 13.928 | |||||||||||||||||||||||||||
1.164 | 0.278 | 2.166 | 1.02 | 0.848 | 1.308 | 1.077 | 9.412 | 0.95 | 0.803 | 0.55 | 0.147 | |||||||||||||||||||||||||||
1.999 | 0.866 | 1.844 | 1.45 | 1.814 | 3.043 | 2.412 | 10.763 | 1.438 | 1.536 | 0.884 | 0.791 | |||||||||||||||||||||||||||
Race | 3.659 | 5.947 | 11.668 | 3.333 | 6.098 | 7.317 | 6.098 | 20.528 | 20.242 | 5.108 | 2.83 | 7.317 | ||||||||||||||||||||||||||
5.979 | 6.385 | 4.834 | 7.61 | 5.135 | 4.144 | 5.922 | 16.802 | 8.742 | 5.776 | 7.261 | 5.922 | |||||||||||||||||||||||||||
1.965 | 4.123 | 8.436 | 2.97 | 2.963 | 2.046 | 2.329 | 6.372 | 13.877 | 2.402 | 2.062 | 3.319 | |||||||||||||||||||||||||||
8.157 | 11.825 | 18.769 | 7.416 | 9.477 | 14.029 | 10.363 | 43.621 | 26.601 | 12.368 | 6.161 | 10.827 | |||||||||||||||||||||||||||
Age | 4.688 | 3.002 | 7.966 | 3.756 | 4.042 | 3.765 | 4.425 | 12.333 | 8.923 | 4.995 | 4.042 | 1.948 | ||||||||||||||||||||||||||
19.14 | 19.394 | 18.86 | 20.025 | 18.688 | 19.789 | 20.438 | 14.893 | 21.909 | 20.674 | 20.674 | 17.134 | |||||||||||||||||||||||||||
3.534 | 1.426 | 4.149 | 2.78 | 3.775 | 3.392 | 2.765 | 10.627 | 4.861 | 2.715 | 2.271 | 0.865 | |||||||||||||||||||||||||||
11.998 | 6.893 | 14.031 | 9.808 | 11.945 | 12.703 | 12.283 | 38.971 | 10.877 | 11.605 | 9.98 | 3.288 | |||||||||||||||||||||||||||
Intersection | 9.375 | 8.995 | 20.066 | 5.556 | 12.5 | 12.5 | 12.5 | 58 | 22.231 | 12.5 | 4.412 | 9.375 | ||||||||||||||||||||||||||
37.067 | 39.108 | 27.455 | 36.047 | 37.067 | 37.352 | 38.372 | 23.106 | 27.997 | 36.047 | 39.393 | 35.026 | |||||||||||||||||||||||||||
5.769 | 5.454 | 18.902 | 4.808 | 4.808 | 4.808 | 4.808 | 22.672 | 17.864 | 4.137 | 3.54 | 4.082 | |||||||||||||||||||||||||||
24.367 | 23.905 | 43.338 | 18.623 | 24.071 | 33.455 | 25.185 | 150.941 | 53.279 | 30.359 | 16.799 | 27.109 | |||||||||||||||||||||||||||
- | ACC | 98.547 | 97.465 | 94.189 | 98.671 | 98.578 | 98.423 | 98.702 | 73.447 | 94.405 | 97.682 | 98.887 | 99.073 | |||||||||||||||||||||||||
AUC | 99.736 | 99.581 | 99.501 | 99.756 | 99.644 | 99.387 | 99.856 | 80.642 | 97.922 | 99.704 | 99.781 | 99.923 | ||||||||||||||||||||||||||
AP | 99.423 | 98.911 | 99.063 | 99.497 | 99.079 | 98.07 | 99.725 | 67.995 | 95.558 | 99.432 | 99.657 | 99.867 | ||||||||||||||||||||||||||
EER | 1.525 | 1.672 | 2.951 | 1.279 | 1.426 | 1.574 | 1.328 | 26.365 | 6.05 | 2.361 | 1.279 | 1.082 |
Model Type | ||||||||||||||||||||||||||||||||||||||
Naive | Frequency | Spatial | Fairness-enhanced | |||||||||||||||||||||||||||||||||||
Dataset | Attribute | Metric |
|
|
|
|
|
|
|
|
|
|
|
|
||||||||||||||||||||||||
SD1.5 | Gender | 0.509 | 3.139 | 0.317 | 1.139 | 1.186 | 0.248 | 1.674 | 2.635 | 0.891 | 0.457 | 1.275 | 0.345 | |||||||||||||||||||||||||
22.081 | 19.548 | 20.978 | 22.834 | 22.776 | 22.724 | 22.857 | 11.878 | 20.185 | 20.993 | 23.084 | 22.934 | |||||||||||||||||||||||||||
0.495 | 1.293 | 1.576 | 0.178 | 0.153 | 0.021 | 0.332 | 4.404 | 2.235 | 1.599 | 0.07 | 0.119 | |||||||||||||||||||||||||||
0.613 | 3.185 | 0.554 | 1.705 | 1.56 | 0.359 | 1.886 | 3.738 | 1.065 | 0.625 | 1.661 | 0.399 | |||||||||||||||||||||||||||
Race | 1.968 | 7.1 | 8.289 | 2.12 | 1.5 | 1.342 | 1.959 | 19.503 | 18.674 | 2.837 | 1.469 | 1.323 | ||||||||||||||||||||||||||
15.903 | 16.759 | 15.365 | 15.889 | 15.045 | 14.817 | 15.624 | 11.712 | 14.571 | 14.654 | 15.689 | 14.94 | |||||||||||||||||||||||||||
0.943 | 2.918 | 4.108 | 1.294 | 0.879 | 0.297 | 1.454 | 10.28 | 9.562 | 1.645 | 1.347 | 0.901 | |||||||||||||||||||||||||||
4.614 | 15.815 | 13.016 | 4.227 | 2.198 | 2.227 | 3.486 | 40.819 | 27.388 | 6.944 | 3.342 | 3.124 | |||||||||||||||||||||||||||
Age | 2.87 | 3.457 | 5.482 | 1.658 | 3.43 | 1.295 | 3.244 | 5.749 | 11.164 | 4.406 | 2.534 | 1.061 | ||||||||||||||||||||||||||
31.026 | 28.27 | 31.043 | 30.768 | 32.076 | 31.734 | 31.627 | 16.203 | 30.92 | 31.262 | 31.78 | 32.059 | |||||||||||||||||||||||||||
2.054 | 2.83 | 1.832 | 1.244 | 2.275 | 0.828 | 2.1 | 11.13 | 2.965 | 2.806 | 1.818 | 0.942 | |||||||||||||||||||||||||||
6.24 | 9.127 | 9.167 | 3.913 | 6.148 | 2.276 | 5.276 | 12.176 | 15.841 | 7.182 | 5.215 | 2.787 | |||||||||||||||||||||||||||
Intersection | 3.333 | 11.68 | 11.018 | 3.333 | 6.667 | 2.439 | 6.206 | 24.497 | 23.778 | 3.283 | 6.206 | 1.695 | ||||||||||||||||||||||||||
34.936 | 32.557 | 32.823 | 35.564 | 34.066 | 33.963 | 34.985 | 24.536 | 30.941 | 32.333 | 35.227 | 34.35 | |||||||||||||||||||||||||||
1.928 | 5.084 | 4.661 | 1.915 | 3.382 | 1.667 | 2.27 | 14.398 | 11.932 | 3.428 | 2.27 | 1.208 | |||||||||||||||||||||||||||
10.679 | 35.812 | 28.777 | 12.75 | 18.354 | 10.377 | 15.07 | 88.474 | 69.692 | 15.9 | 13.668 | 8.346 | |||||||||||||||||||||||||||
- | ACC | 97.272 | 95.847 | 95.862 | 97.833 | 97.848 | 99.045 | 98.151 | 73.219 | 94.983 | 95.696 | 98.424 | 99.47 | |||||||||||||||||||||||||
AUC | 99.792 | 98.953 | 99.499 | 99.803 | 99.826 | 99.766 | 99.877 | 86.563 | 97.922 | 99.63 | 99.893 | 99.963 | ||||||||||||||||||||||||||
AP | 99.832 | 98.661 | 99.538 | 99.828 | 99.862 | 99.716 | 99.914 | 85.449 | 97.861 | 99.675 | 99.928 | 99.969 | ||||||||||||||||||||||||||
EER | 1.887 | 4.073 | 2.947 | 1.755 | 1.523 | 0.993 | 1.192 | 21.159 | 6.424 | 2.682 | 0.861 | 0.53 |
Model Type | ||||||||||||||||||||||||||||||||||||||
Naive | Frequency | Spatial | Fairness-enhanced | |||||||||||||||||||||||||||||||||||
Dataset | Attribute | Metric |
|
|
|
|
|
|
|
|
|
|
|
|
||||||||||||||||||||||||
SD Inpainting | Gender | 2.241 | 6.934 | 2.814 | 2.686 | 1.636 | 2.432 | 1.652 | 4.449 | 1.288 | 2.495 | 1.455 | 0.849 | |||||||||||||||||||||||||
20.739 | 14.685 | 21.614 | 23.834 | 21.747 | 23.587 | 23.36 | 12.632 | 19.574 | 19.711 | 22.764 | 22.798 | |||||||||||||||||||||||||||
2.701 | 7.184 | 0.172 | 1.041 | 2.434 | 1.154 | 0.901 | 6.926 | 1.511 | 3.887 | 1.367 | 0.746 | |||||||||||||||||||||||||||
2.704 | 6.971 | 3.64 | 3.722 | 3.096 | 2.449 | 2.302 | 5.737 | 2.278 | 2.932 | 1.837 | 1.048 | |||||||||||||||||||||||||||
Race | 2.353 | 8.159 | 13.709 | 4.767 | 2.884 | 3.957 | 4.424 | 15.135 | 23.357 | 4.628 | 4.424 | 1.599 | ||||||||||||||||||||||||||
11.33 | 9.279 | 10.32 | 11.773 | 10.267 | 12.654 | 11.567 | 6.414 | 10.94 | 12.332 | 13.285 | 12.717 | |||||||||||||||||||||||||||
2.569 | 6.514 | 5.566 | 1.598 | 2.857 | 1.908 | 2.863 | 7.86 | 9.299 | 2.983 | 2.904 | 1.378 | |||||||||||||||||||||||||||
6.652 | 21.735 | 26.462 | 7.096 | 7.51 | 8.601 | 8.099 | 34.693 | 34.503 | 9.538 | 11.207 | 3.61 | |||||||||||||||||||||||||||
Age | 6.106 | 10.945 | 8.907 | 8.131 | 6.14 | 6.319 | 6.518 | 5.833 | 5.738 | 6.494 | 6.329 | 1.77 | ||||||||||||||||||||||||||
34.829 | 24.512 | 35.588 | 35.072 | 36.422 | 36.373 | 35.589 | 22.872 | 33.578 | 35.295 | 35.397 | 36.172 | |||||||||||||||||||||||||||
5.83 | 11.618 | 4.664 | 3.031 | 4.865 | 2.925 | 3.538 | 10.413 | 2.475 | 6.172 | 3.678 | 1.276 | |||||||||||||||||||||||||||
11.823 | 19.295 | 20.31 | 11.192 | 12.313 | 8.877 | 10.26 | 16.169 | 10.115 | 11.654 | 10.325 | 4.012 | |||||||||||||||||||||||||||
Intersection | 6.237 | 19.863 | 19.037 | 8.725 | 5.369 | 6.711 | 7.383 | 19.06 | 31.294 | 12.213 | 7.383 | 3.693 | ||||||||||||||||||||||||||
27.04 | 21.323 | 28.54 | 30.477 | 27.278 | 30.434 | 29.669 | 17.974 | 28.629 | 25.716 | 30.228 | 30.269 | |||||||||||||||||||||||||||
4.884 | 13.648 | 6.769 | 4.329 | 5.395 | 3.493 | 4.397 | 14.884 | 12.987 | 9.387 | 6.247 | 3.194 | |||||||||||||||||||||||||||
18.736 | 61.309 | 52.861 | 25.231 | 20.741 | 27.99 | 21.834 | 69.995 | 61.174 | 28.951 | 22.144 | 11.665 | |||||||||||||||||||||||||||
- | ACC | 95.133 | 86.754 | 94.333 | 96.517 | 95.475 | 97.445 | 96.86 | 78.49 | 94.105 | 92.849 | 96.846 | 98.715 | |||||||||||||||||||||||||
AUC | 99.552 | 97.281 | 98.31 | 99.525 | 99.547 | 99.659 | 99.687 | 89.51 | 97.403 | 99.386 | 99.707 | 99.912 | ||||||||||||||||||||||||||
AP | 99.679 | 97.138 | 98.529 | 99.652 | 99.677 | 99.727 | 99.766 | 91.313 | 97.767 | 99.564 | 99.79 | 99.939 | ||||||||||||||||||||||||||
EER | 3.434 | 7.631 | 6.729 | 3.226 | 3.33 | 2.463 | 2.428 | 17.933 | 7.527 | 3.954 | 2.393 | 1.283 |
B.5 Details of Post-Processing
In Section 4 we have applied 6 post-processing methods to evaluate detectors’ robustness. Fig. B.1 visualizes the image after being applied different post-processing methods. We describe each post-processing method as follows:
JPEG Compression: Image compression introduces compression artifacts and reduces the image quality, simulating real-world scenarios where images may be of lower quality or have compression artifacts. In Fig. 6 we apply image compression with quality 60 to each image in the test set.
Gaussian Blur: This post-processing reduces image detail and noise by smoothing it through averaging pixel values with a Gaussian kernel. In Fig. 6 we apply gaussian blur with kernel size 7 to each image in the test set.
Hue Saturation Value: Alters the hue, saturation, and value of the image within specified limits. This post-processing technique is used to simulate variations in color and lighting conditions. Adjusting the hue changes the overall color tone, saturation controls the intensity of colors, and value adjusts the brightness. The results in Fig. 6 are after we adjust hue, saturation, and value with shifting limits 30.
Random Brightness and Contrast: This post-processing method adjusts the brightness and contrast of the image within specified limits. By applying random brightness and contrast variations, it introduces changes in the illumination and contrast levels of the images. This evaluates detector’s robustness to different illumination conditions. The results in Fig. 6 are after we adjust brightness and contrast with shifting limits 0.2.
Random Crop: Resizes the image to a specified size and then randomly crops a portion of it to the target dimensions. This post-processing method is used to evaluate the detector’s robustness to variations in the spatial content of the image. The results in Fig. 6 are after we randomly crop the image with target dimension of .
Rotation: Rotates the image within a specified angle limit. This post-processing method is used to evaluate the detector’s robustness to changes in the orientation of objects within the image. The results in Fig. 6 are after we randomly rotate the image within a range of -45 to 45 degrees.
B.6 Additional Fairness Robustness Evaluation Results
Fig. B.2 to Fig. B.6 demonstrate detectors’ robustness analysis in more detail as a function of different degrees of post-processing. Overall, ViT-B/16 [63] and UnivFD [67] show stronger robustness to various post-processing methods compared to other detection methods. Fairness-enhanced detectors do not have robustness against post-processing; this would be a direction for future studies to work on. Figure B.2 presents a detailed robustness analysis in terms of utility and fairness under varying degrees of JPEG compression. The utility of all detectors decreases as image quality is reduced. Among the detectors, UnivFD [67] exhibits the highest utility robustness, while ViT-B/16 [63] demonstrates the strongest fairness robustness. When considering Gaussian blur, ViT-B/16 stands out as the most robust detector in terms of utility, whereas EfficientB4 [62] shows the greatest robustness in terms of fairness. Against Hue Saturation Value adjustments, DAW-FDD [20] shows the strongest utility robustness, while UnivFD excels in fairness robustness. ViT-B/16 demonstrates superior robustness in both utility and fairness when facing rotations. For brightness contrast variations, DAG-FDD [20] is the most robust detector in terms of utility, while UnivFD once again shows superior robustness in terms of fairness.
B.7 Additional Fairness Generalization Evaluation Results
We conduct additional generalization experiments by using models trained on FF++ [2] to evaluate their generalization performance on our AI-Face test set. For these experiments, we utilize the trained weights and intra-domain performance metrics provided by [16]. Consequently, only the detectors with the pre-trained weights available from [16] are evaluated on our AI-Face test set. Results are shown in Table 6. We report the detailed performance on generation category subsets (i.e., Deepfake Videos, GANs, and DMs) and the overall performance on the whole test set. We observe that detectors exhibit significant performance degradation, approaching coin-toss performance when trained on FF++ and tested on our AI-Face test set. This suggests that detectors trained solely on one deepfake video dataset is not sufficient for detecting face images generated by current more advanced generation models. This also highlights the significance of our AI-Face dataset, which is extensive, diverse and comprehensive in generation methods to develop and evaluate existing AI face detectors. The lowest performance is observed with GANs, likely due to the higher variety of generation methods within this category. Conversely, performance on the Deepfake Videos subset is relatively better. This could be because, despite being different datasets, the deepfake videos may share similar generation methods, resulting in less variation in the artifacts present in the generated images.
Type | Detector | Intra- Domain (FF++) | Cross-Domain (Ours w/o FF++) Test Subset | Cross-Domain (Ours w/o FF++) Whole Test Set | ||||||
|
GANs (10) | DMs (8) | ||||||||
AUC | AUC | AUC | AUC | AUC | ||||||
Naive | Xception [61] | 96.370 | 104.961 | 77.766 | 139.963 | 58.228 | 110.977 | 78.622 | 101.194 | 72.649 |
EfficientB4 [62] | 95.670 | 110.626 | 76.612 | 148.656 | 44.501 | 88.420 | 73.426 | 94.609 | 65.323 | |
Frequency | F3Net [64] | 96.350 | 74.828 | 74.328 | 93.278 | 39.127 | 89.927 | 75.480 | 68.299 | 65.149 |
SPSL [65] | 96.100 | 97.558 | 77.766 | 141.029 | 40.100 | 91.837 | 58.919 | 123.534 | 55.483 | |
SRM [66] | 95.760 | 60.855 | 74.900 | 89.903 | 57.572 | 73.209 | 77.954 | 57.775 | 72.474 | |
Spatial | UCF [16] | 97.050 | 102.798 | 77.650 | 122.485 | 40.477 | 95.657 | 77.568 | 79.479 | 67.708 |
CORE [68] | 96.380 | 69.717 | 76.506 | 95.727 | 45.549 | 79.161 | 82.112 | 72.424 | 70.662 |
B.8 Full Results of Effect of Increasing the Size of Train Set
In this section, we provide the full evaluation results tested under different sizes of train set, as shown from Table 37 to Table 40. Intersection and AUC align with the results in Fig. 7 of the submitted manuscript.
Model Type | ||||||||||||||||||||||||||||||||||||||
Naive | Frequency | Spatial | Fairness-enhanced | |||||||||||||||||||||||||||||||||||
Dataset Size | Attribute | Metric |
|
|
|
|
|
|
|
|
|
|
|
|
||||||||||||||||||||||||
20% | Gender | 1.725 | 0.366 | 0.863 | 1.523 | 0.916 | 1.818 | 0.652 | 0.369 | 2.196 | 1.657 | 1.549 | 1.428 | |||||||||||||||||||||||||
1.944 | 2.239 | 2.618 | 2.083 | 2.305 | 1.586 | 2.811 | 2.317 | 1.543 | 1.823 | 2.269 | 2.106 | |||||||||||||||||||||||||||
1.076 | 0.419 | 0.906 | 1.057 | 0.775 | 0.800 | 0.617 | 0.620 | 1.076 | 0.950 | 1.145 | 0.904 | |||||||||||||||||||||||||||
2.030 | 0.386 | 1.635 | 1.945 | 1.280 | 2.081 | 0.768 | 0.629 | 2.214 | 1.738 | 2.244 | 1.742 | |||||||||||||||||||||||||||
Race | 14.155 | 11.039 | 10.108 | 13.887 | 12.235 | 15.231 | 11.756 | 14.625 | 16.804 | 16.116 | 12.021 | 12.645 | ||||||||||||||||||||||||||
23.488 | 20.018 | 22.360 | 23.285 | 22.782 | 22.998 | 22.994 | 22.628 | 25.752 | 23.457 | 23.093 | 22.572 | |||||||||||||||||||||||||||
5.266 | 5.286 | 5.057 | 5.416 | 4.807 | 5.425 | 5.063 | 6.459 | 5.913 | 5.009 | 4.877 | 4.676 | |||||||||||||||||||||||||||
24.015 | 19.947 | 25.662 | 25.293 | 22.940 | 23.207 | 24.837 | 28.765 | 29.623 | 22.162 | 22.625 | 21.318 | |||||||||||||||||||||||||||
Age | 6.766 | 5.613 | 5.335 | 7.254 | 5.765 | 6.506 | 8.761 | 5.411 | 7.208 | 5.948 | 5.672 | 5.769 | ||||||||||||||||||||||||||
5.086 | 5.581 | 6.666 | 5.089 | 5.561 | 4.659 | 6.170 | 6.073 | 4.556 | 5.080 | 5.291 | 5.337 | |||||||||||||||||||||||||||
3.784 | 3.177 | 4.958 | 3.745 | 3.493 | 3.435 | 4.491 | 4.692 | 4.209 | 4.183 | 3.159 | 3.242 | |||||||||||||||||||||||||||
9.533 | 9.157 | 12.476 | 9.632 | 9.222 | 9.203 | 11.928 | 14.228 | 10.548 | 9.699 | 8.470 | 8.339 | |||||||||||||||||||||||||||
Intersection | 17.912 | 12.056 | 14.781 | 17.613 | 14.966 | 19.221 | 14.360 | 17.533 | 20.977 | 19.466 | 15.288 | 15.734 | ||||||||||||||||||||||||||
25.299 | 22.237 | 23.053 | 25.005 | 23.895 | 25.807 | 23.863 | 23.563 | 27.720 | 25.542 | 24.273 | 24.374 | |||||||||||||||||||||||||||
8.001 | 9.313 | 8.647 | 7.898 | 7.506 | 6.137 | 8.856 | 11.806 | 8.713 | 5.859 | 7.538 | 5.378 | |||||||||||||||||||||||||||
54.208 | 45.790 | 54.752 | 56.299 | 50.295 | 52.526 | 55.119 | 66.894 | 63.986 | 44.272 | 49.137 | 45.127 | |||||||||||||||||||||||||||
- | ACC | 95.175 | 94.292 | 93.972 | 94.913 | 95.084 | 95.534 | 95.249 | 90.810 | 94.835 | 94.996 | 95.243 | 95.602 | |||||||||||||||||||||||||
AUC | 98.620 | 99.055 | 98.765 | 98.284 | 98.851 | 98.026 | 98.728 | 96.404 | 98.237 | 98.403 | 98.731 | 98.533 | ||||||||||||||||||||||||||
AP | 98.805 | 99.325 | 99.132 | 98.441 | 99.083 | 98.353 | 98.931 | 97.227 | 98.410 | 98.578 | 98.980 | 98.695 | ||||||||||||||||||||||||||
EER | 5.563 | 5.208 | 6.267 | 6.142 | 5.292 | 5.489 | 4.933 | 10.001 | 6.169 | 5.696 | 5.424 | 5.148 |
Model Type | ||||||||||||||||||||||||||||||||||||||
Naive | Frequency | Spatial | Fairness-enhanced | |||||||||||||||||||||||||||||||||||
Dataset Size | Attribute | Metric |
|
|
|
|
|
|
|
|
|
|
|
|
||||||||||||||||||||||||
40% | Gender | 1.771 | 2.562 | 1.588 | 1.383 | 1.277 | 0.567 | 1.191 | 0.752 | 1.465 | 1.034 | 1.303 | 1.41 | |||||||||||||||||||||||||
1.801 | 3.841 | 1.948 | 2.33 | 2.088 | 1.955 | 2.715 | 2.756 | 2.113 | 2.362 | 2.128 | 2.998 | |||||||||||||||||||||||||||
0.908 | 0.439 | 0.881 | 1.117 | 0.661 | 0.078 | 1.281 | 0.625 | 0.971 | 0.811 | 0.793 | 1.193 | |||||||||||||||||||||||||||
1.799 | 3.236 | 1.809 | 2.034 | 1.415 | 1.095 | 2.286 | 1.023 | 1.796 | 1.474 | 1.51 | 2.263 | |||||||||||||||||||||||||||
Race | 14.7 | 12.688 | 9.731 | 14.333 | 10.203 | 7.959 | 14.511 | 14.169 | 14.04 | 11.7 | 12.504 | 7.79 | ||||||||||||||||||||||||||
23.675 | 21.948 | 22.57 | 23.424 | 22.165 | 21.403 | 25.546 | 23.264 | 22.811 | 21.994 | 22.856 | 21.024 | |||||||||||||||||||||||||||
5.079 | 6.318 | 4.774 | 4.986 | 4.49 | 3.571 | 5.708 | 6.282 | 5.222 | 3.819 | 4.448 | 4.043 | |||||||||||||||||||||||||||
23.443 | 17.727 | 22.852 | 22.917 | 20.787 | 17.734 | 30.682 | 30.21 | 22.288 | 17.342 | 20.633 | 18.703 | |||||||||||||||||||||||||||
Age | 7.594 | 3.343 | 4.055 | 6.859 | 5.051 | 4.126 | 6.145 | 5.676 | 6.85 | 5.874 | 6.46 | 3.48 | ||||||||||||||||||||||||||
4.951 | 6.723 | 5.222 | 5.421 | 5.485 | 4.937 | 5.55 | 6.471 | 5.272 | 5.672 | 5.447 | 6.276 | |||||||||||||||||||||||||||
3.873 | 1.951 | 2.928 | 3.709 | 3.057 | 2.589 | 3.747 | 4.713 | 3.447 | 3.655 | 3.689 | 3.158 | |||||||||||||||||||||||||||
9.596 | 9.58 | 8.258 | 8.736 | 8.461 | 8.457 | 9.256 | 14.995 | 8.374 | 9.236 | 9.119 | 8.222 | |||||||||||||||||||||||||||
Intersection | 18.307 | 20.275 | 14.911 | 17.454 | 12.641 | 12.131 | 19.386 | 17.922 | 17.346 | 13.83 | 15.213 | 11.211 | ||||||||||||||||||||||||||
25.685 | 24.437 | 23.109 | 24.801 | 23.725 | 22.091 | 26.683 | 23.444 | 24.706 | 23.145 | 24.355 | 21.662 | |||||||||||||||||||||||||||
5.814 | 10.624 | 8.96 | 5.964 | 5.707 | 6.691 | 10.527 | 11.615 | 5.902 | 4.63 | 5.04 | 6.579 | |||||||||||||||||||||||||||
48.63 | 41.478 | 49.564 | 47.936 | 44.562 | 37.57 | 63.173 | 66.301 | 49.171 | 34.392 | 43.13 | 40.847 | |||||||||||||||||||||||||||
- | ACC | 95.796 | 94.03 | 94.822 | 95.844 | 95.794 | 95.393 | 94.754 | 90.711 | 95.984 | 95.975 | 96.257 | 95.337 | |||||||||||||||||||||||||
AUC | 98.696 | 98.932 | 99.024 | 98.851 | 99.064 | 98.722 | 98.306 | 96.371 | 98.824 | 98.974 | 98.949 | 99.092 | ||||||||||||||||||||||||||
AP | 98.778 | 99.269 | 99.31 | 98.959 | 99.236 | 98.968 | 98.588 | 97.224 | 98.984 | 99.139 | 99.035 | 99.318 | ||||||||||||||||||||||||||
EER | 5.027 | 5.442 | 5.474 | 5.002 | 4.574 | 4.77 | 6.037 | 10.044 | 5.009 | 4.567 | 4.285 | 4.729 |
Model Type | ||||||||||||||||||||||||||||||||||||||
Naive | Frequency | Spatial | Fairness-enhanced | |||||||||||||||||||||||||||||||||||
Dataset Size | Attribute | Metric |
|
|
|
|
|
|
|
|
|
|
|
|
||||||||||||||||||||||||
60% | Gender | 0.801 | 0.489 | 2.239 | 1.576 | 1.787 | 1.179 | 0.745 | 0.539 | 0.907 | 1.512 | 1.527 | 0.408 | |||||||||||||||||||||||||
2.899 | 2.747 | 4.223 | 1.996 | 1.638 | 3.062 | 2.596 | 2.509 | 2.291 | 3.01 | 1.88 | 2.857 | |||||||||||||||||||||||||||
0.692 | 0.109 | 0.935 | 0.904 | 0.802 | 0.361 | 0.783 | 0.634 | 0.697 | 1.296 | 0.716 | 0.27 | |||||||||||||||||||||||||||
1.136 | 0.677 | 3.474 | 1.757 | 2.002 | 1.22 | 1.328 | 0.586 | 1.153 | 2.435 | 1.594 | 0.547 | |||||||||||||||||||||||||||
Race | 8.652 | 16.885 | 6.328 | 13.433 | 16.19 | 6.243 | 9.96 | 14.482 | 14.243 | 9.849 | 14.223 | 5.96 | ||||||||||||||||||||||||||
21.794 | 26.205 | 14.609 | 23.519 | 21.498 | 18.874 | 20.947 | 23.469 | 24.031 | 20.453 | 23.547 | 15.86 | |||||||||||||||||||||||||||
3.781 | 6.63 | 5.65 | 4.671 | 5.716 | 4.346 | 4.133 | 6.328 | 4.569 | 3.77 | 5.247 | 3.746 | |||||||||||||||||||||||||||
18.942 | 23.67 | 21.707 | 22.128 | 23.478 | 12.107 | 15.885 | 29.99 | 22.562 | 14.96 | 22.213 | 14.735 | |||||||||||||||||||||||||||
Age | 5.047 | 5.153 | 5.719 | 6.155 | 4.154 | 3.81 | 4.71 | 5.243 | 5.789 | 5.512 | 3.699 | 3.553 | ||||||||||||||||||||||||||
6.012 | 4.411 | 7.664 | 5.157 | 5.456 | 6.02 | 6.042 | 6.245 | 5.444 | 7.926 | 5.023 | 6.488 | |||||||||||||||||||||||||||
2.916 | 2.496 | 4.283 | 3.316 | 3.897 | 2.374 | 2.752 | 4.555 | 3.422 | 3.886 | 2.858 | 2.244 | |||||||||||||||||||||||||||
7.607 | 10.635 | 13.662 | 8.084 | 9.09 | 7.503 | 6.872 | 14.282 | 8.089 | 8.321 | 6.951 | 8.124 | |||||||||||||||||||||||||||
Intersection | 10.466 | 25.134 | 7.982 | 16.425 | 17.532 | 10.693 | 12.44 | 17.613 | 16.374 | 12.417 | 17.272 | 9.574 | ||||||||||||||||||||||||||
22.891 | 27.88 | 18.106 | 25.118 | 24.338 | 20.236 | 22.678 | 23.819 | 25.063 | 22.277 | 25.459 | 18.176 | |||||||||||||||||||||||||||
5.884 | 11.229 | 7.443 | 5.547 | 6.822 | 7.714 | 4.749 | 11.612 | 5.287 | 5.726 | 5.873 | 5.899 | |||||||||||||||||||||||||||
39.873 | 51.509 | 46.548 | 46.511 | 52.673 | 28.055 | 35.888 | 66.884 | 45.261 | 30.682 | 47.626 | 31.167 | |||||||||||||||||||||||||||
- | ACC | 96.505 | 93.931 | 93.612 | 96.221 | 95.676 | 96.51 | 96.567 | 90.882 | 96.009 | 95.025 | 96.332 | 96.488 | |||||||||||||||||||||||||
AUC | 98.97 | 98.828 | 98.536 | 99.075 | 99.102 | 99.236 | 99.026 | 96.461 | 99.189 | 99.003 | 99.354 | 99.401 | ||||||||||||||||||||||||||
AP | 98.987 | 99.195 | 98.953 | 99.17 | 99.234 | 99.415 | 99.012 | 97.279 | 99.351 | 99.285 | 99.503 | 99.461 | ||||||||||||||||||||||||||
EER | 3.829 | 6.004 | 6.668 | 4.322 | 4.314 | 3.248 | 3.592 | 9.875 | 4.351 | 5.072 | 3.882 | 3.583 |
Model Type | ||||||||||||||||||||||||||||||||||||||
Naive | Frequency | Spatial | Fairness-enhanced | |||||||||||||||||||||||||||||||||||
Dataset Size | Attribute | Metric |
|
|
|
|
|
|
|
|
|
|
|
|
||||||||||||||||||||||||
80% | Gender | 1.753 | 0.256 | 1.697 | 0.976 | 1.199 | 0.166 | 1.235 | 0.447 | 1.428 | 0.398 | 0.526 | 0.339 | |||||||||||||||||||||||||
1.925 | 2.648 | 1.891 | 2.861 | 3.002 | 2.642 | 3.511 | 2.461 | 1.881 | 2.762 | 2.695 | 2.643 | |||||||||||||||||||||||||||
0.943 | 0.002 | 0.988 | 0.280 | 0.474 | 0.316 | 0.737 | 0.596 | 0.665 | 0.214 | 0.408 | 0.172 | |||||||||||||||||||||||||||
1.893 | 0.364 | 1.910 | 1.214 | 1.489 | 0.218 | 1.237 | 0.467 | 1.495 | 0.522 | 0.788 | 0.384 | |||||||||||||||||||||||||||
Race | 11.908 | 11.806 | 9.589 | 4.724 | 3.751 | 8.864 | 2.988 | 14.911 | 13.396 | 3.892 | 5.036 | 2.891 | ||||||||||||||||||||||||||
22.332 | 21.476 | 19.620 | 18.520 | 17.354 | 18.431 | 16.783 | 23.411 | 22.809 | 16.666 | 18.631 | 17.598 | |||||||||||||||||||||||||||
4.298 | 6.127 | 4.724 | 3.890 | 3.573 | 4.030 | 2.667 | 6.322 | 4.970 | 3.282 | 4.350 | 2.731 | |||||||||||||||||||||||||||
18.793 | 19.118 | 20.637 | 10.997 | 9.458 | 14.889 | 10.966 | 29.610 | 22.226 | 9.621 | 13.148 | 8.090 | |||||||||||||||||||||||||||
Age | 5.554 | 4.823 | 4.219 | 2.168 | 3.355 | 2.588 | 1.699 | 5.731 | 6.528 | 2.822 | 1.498 | 1.076 | ||||||||||||||||||||||||||
5.307 | 5.675 | 5.586 | 6.111 | 6.492 | 6.159 | 6.884 | 6.433 | 4.840 | 6.781 | 5.943 | 5.842 | |||||||||||||||||||||||||||
3.001 | 2.397 | 3.365 | 1.221 | 1.905 | 1.389 | 0.832 | 4.916 | 3.150 | 2.718 | 1.133 | 0.744 | |||||||||||||||||||||||||||
7.274 | 8.476 | 8.710 | 4.026 | 8.252 | 5.533 | 5.139 | 15.746 | 8.380 | 7.114 | 4.174 | 2.835 | |||||||||||||||||||||||||||
Intersection | 14.979 | 17.336 | 11.294 | 6.650 | 6.863 | 9.372 | 5.369 | 18.159 | 16.769 | 5.729 | 8.210 | 5.443 | ||||||||||||||||||||||||||
24.220 | 21.943 | 21.145 | 21.258 | 20.254 | 20.920 | 19.077 | 24.033 | 24.954 | 18.015 | 20.556 | 19.798 | |||||||||||||||||||||||||||
5.025 | 10.608 | 7.908 | 6.697 | 5.709 | 6.343 | 5.118 | 11.541 | 5.558 | 5.760 | 7.955 | 4.583 | |||||||||||||||||||||||||||
40.744 | 44.028 | 45.684 | 27.249 | 22.360 | 32.012 | 24.504 | 66.750 | 46.492 | 21.906 | 29.401 | 17.687 | |||||||||||||||||||||||||||
- | ACC | 96.629 | 94.917 | 94.904 | 95.309 | 96.461 | 96.548 | 97.736 | 90.898 | 95.586 | 95.808 | 97.317 | 98.277 | |||||||||||||||||||||||||
AUC | 99.361 | 98.788 | 99.143 | 99.409 | 99.597 | 99.682 | 99.753 | 96.501 | 98.440 | 99.419 | 99.739 | 99.860 | ||||||||||||||||||||||||||
AP | 99.429 | 99.051 | 99.403 | 99.523 | 99.653 | 99.765 | 99.801 | 97.308 | 98.562 | 99.589 | 99.817 | 99.874 | ||||||||||||||||||||||||||
EER | 3.538 | 5.198 | 5.189 | 3.894 | 3.138 | 2.707 | 2.276 | 9.817 | 5.470 | 4.259 | 2.745 | 1.738 |
B.9 Fairness and Utility Trade-off
Fig. B.7 presents the trade-offs between on age and AUC of three fairness-enhanced methods. This is to analyze how well these methods balance optimizing utility and ensuring fairness in decision-making. 1) PG-FDD [21] achieves the best utility-fairness trade-off overall. It improves fairness without compromising the precision of utility, maintaining high accuracy in detection. For instance, PG-FDD achieves a higher AUC than DAW-FDD and DAG-FDD while maintaining comparable fairness metrics. 2) DAW-FDD [20] is sensitive to the hyperparameter that balances utility-fairness. For example, when its fairness approaches to zero, its utility also drops to a coin-tossing performance. This sensitivity can hinder practical deployment, as extensive tuning is required to optimize performance. 3) To ensure broader applicability and reliability, future fairness approaches should aim to minimize sensitivity to hyperparameter settings.
Appendix C Datasheet for AI-Face
In this section, we present a DataSheet [87] for AI-Face.
C.1 Motivation For Dataset Creation
-
•
Why is the dataset created? For researchers to evaluate the fairness of AI face detection models or to train fairer models. Please see Section 2 ‘Background and Motivation’ in the submitted manuscript.
-
•
Has the dataset been used already? Yes. Our fairness benchmark is based on this dataset.
-
•
What (other) tasks could the dataset be used for? Could be used as training data for generative methods attribution task.
-
•
Who funded dataset creation? This work is supported by the U.S. National Science Foundation (NSF) under grant IIS-2348419 and the National Artificial Intelligence Research Resource (NAIRR) Pilot and TACC Lonestar6. Please see Acknowledgment Acknowledgment.
C.2 Data Composition
-
•
What are the instances? The instances that we consider in this work are real face images and AI-generated face images from public datasets.
-
•
How many instances are there? We include more than 2 million face images from public datasets. Please see Table 13 for details.
-
•
What data does each instance consist of? Each instance consists of an image.
-
•
Is there a label or target associated with each instance? Each image is associated with uncertainty score for gender prediction, uncertainty score for age prediction, uncertainty score for race prediction, gender annotation, age annotation, race annotation, and target label (fake or real).
-
•
Is any information missing from individual instances? No.
-
•
Are relationships between individual instances made explicit? Not applicable – we do not study the relationship between each image.
-
•
Does the dataset contain all possible instances or is it a sample? Contains all instances our curation pipeline collected. Since the current dataset does not cover all available images online, there is a high probability more instances can be collected in the future.
-
•
Are there recommended data splits (e.g., training, development/validation, testing)? For detector development and training, the dataset can be split as 6:2:2.
-
•
Are there any errors, sources of noise, or redundancies in the dataset? If so, please provide a description. Yes. Despite our extensive efforts to reduce demographic label noise, including human corrections based on uncertainty scores, there may still be mislabeled instances. Given the dataset’s size of over 2 million images, it is impractical for humans to manually check and correct each image individually.
-
•
Is the dataset self-contained, or does it link to or otherwise rely on external resources (e.g., websites, tweets, other datasets)? The dataset is self-contained.
C.3 Collection Process
-
•
What mechanisms or procedures were used to collect the data? We build our AI-Face dataset by collecting and integrating public AI-generated face images sourced from academic publications, GitHub repositories, and commercial tools. Please see ‘Data Collection’ in Section 3.2
-
•
How was the data associated with each instance acquired? Was the data directly observable (e.g., raw text, movie ratings), reported by subjects (e.g., survey responses), or indirectly inferred/derived from other data? The data can be acquired after our verification of user submitted and signed EULA.
-
•
If the dataset is a sample from a larger set, what was the sampling strategy (e.g., deterministic, probabilistic with specific sampling probabilities)? Not applicable. We did not sample data from a larger set. But we use RetinaFace [60] for detecting and cropping faces to ensure each image only contains one face.
-
•
Over what timeframe was the data collected? Does this timeframe match the creation timeframe of the data associated with the instances (e.g., recent crawl of old news articles)? If not, please describe the timeframe in which the data associated with the instances was created. The data was collected from February 2024 to April 2024, even though the data were originally released before this time. Please refer to the cited papers in Table 13 for specific original data released time.
C.4 Data Processing
-
•
Was any preprocessing/cleaning/labeling of the data done (e.g., discretization or bucketing, tokenization, part-of-speech tagging, SIFT feature extraction, removal of instances, processing of missing values)? Yes. We discussed in ‘Demographically Annotation Generation’ in Section 3.2.
-
•
Was the ‘raw’ data saved in addition to the preprocessed/cleaned/labeled data (e.g., to support unanticipated future uses)? If so, please provide a link or other access point to the ‘raw’ data. The ‘raw’ data can be acquired through the original data publisher. Please see the cited papers in Table 13.
-
•
Is the software used to preprocess/clean/label the instances available? If so, please provide a link or other access point. Yes. We use RetinaFace [60] for detecting and cropping faces to ensure each image only contains one face. Demographic annotations are given by our annotator, see ‘Annotator Development’ in Section 3.1. Our annotator code is available on Our GitHub repository.
-
•
Does this dataset collection/processing procedure achieve the motivation for creating the dataset stated in the first section of this datasheet? If not, what are the limitations? Yes. The dataset does allow for the study of our goal, as it covers comprehensive generation methods, demographic annotations for evaluating current detectors and training fairer detectors.
C.5 Dataset Distribution
-
•
How will the dataset be distributed? We distribute all the data as well as CSV files that formatted all annotations of images under the CC BY-NC-ND 4.0 license and strictly for research purposes.
-
•
When will the dataset be released/first distributed? What license (if any) is it distributed under? The data has been released, under the permissible CC BY-NC-ND 4.0 license for research-based use only. Users can access our dataset by submitting an EULA. Dataset license and EULA is on our GitHub https://github.com/Purdue-M2/AI-Face-FairnessBench.
-
•
Are there any copyrights on the data? We believe our use is ‘fair use’ since all data in our dataset is collected from public datasets.
-
•
Are there any fees or access restrictions? No.
C.6 Dataset Maintenance
-
•
Who is supporting/hosting/maintaining the dataset? The first author of this paper.
-
•
Will the dataset be updated? If so, how often and by whom? We do not plan to update it at this time.
-
•
Is there a repository to link to any/all papers/systems that use this dataset? Not right now, but we encourage anyone who uses the dataset to cite our paper so it can be easily found. Our fairness benchmark uses this dataset, the code of fairness benchmark is on our GitHub https://github.com/Purdue-M2/AI-Face-FairnessBench.
-
•
If others want to extend/augment/build on this dataset, is there a mechanism for them to do so? Not at this time.
C.7 Legal and Ethical Considerations
-
•
Were any ethical review processes conducted (e.g., by an institutional review board)? No official processes were done since all data in our dataset were collected from the existing public datasets.
-
•
Does the dataset contain data that might be considered confidential? No. We only use data from public datasets.
-
•
Does the dataset contain data that, if viewed directly, might be offensive, insulting, threatening, or might otherwise cause anxiety? If so, please describe why No. It is a face image dataset, we have not seen any instance of offensive or abusive content.
-
•
Does the dataset relate to people? Yes. It is a face image dataset containing real face images and AI-generated face images.
-
•
Does the dataset identify any subpopulations (e.g., by age, gender)? Yes, through demographic annotations.
-
•
Is it possible to identify individuals (i.e., one or more natural persons), either directly or indirectly (i.e., in combination with other data) from the dataset? Yes. It is a face image dataset. The age, gender, and race can be identified through the face image, also through the demographic annotation we provide. All of the images that we use are from publicly available data.
C.8 Author Statement and Confirmation of Data License
The authors of this work declare that the dataset described and provided has been collected, processed, and made available with full adherence to all applicable ethical guidelines and regulations. We accept full responsibility for any violations of rights or ethical guidelines that may arise from the use of this dataset. We also confirm that the dataset is released under the CC BY-NC-ND 4.0 license, permitting sharing and downloading of the work in any medium, provided the original author is credited, and it is used non-commercially with no derivative works created.