Exploration of Metrics and Datasets to Assess the Fidelity of Images Generated by Generative Adversarial Networks

Valdebenito Maturana, Claudio Navar; Sandoval Orozco, Ana Lucila; García Villalba, Luis Javier

doi:10.3390/app131910637

Open AccessArticle

Exploration of Metrics and Datasets to Assess the Fidelity of Images Generated by Generative Adversarial Networks

by

Claudio Navar Valdebenito Maturana

^†

,

Ana Lucila Sandoval Orozco

^†

and

Luis Javier García Villalba

^*,†

Group of Analysis, Security and Systems (GASS), Department of Software Engineering and Artificial Intelligence (DISIA), Faculty of Computer Science and Engineering, Office 431, Universidad Complutense de Madrid (UCM), Calle Profesor José García Santesmases 9, Ciudad Universitaria, 28040 Madrid, Spain

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Appl. Sci. 2023, 13(19), 10637; https://doi.org/10.3390/app131910637

Submission received: 13 July 2023 / Revised: 14 September 2023 / Accepted: 19 September 2023 / Published: 24 September 2023

Download

Browse Figures

Versions Notes

Abstract

:

Advancements in technology have improved human well-being but also enabled new avenues for criminal activities, including digital exploits like deep fakes, online fraud, and cyberbullying. Detecting and preventing such activities, especially for law enforcement agencies needing photo profiles for covert operations, is imperative. Yet, conventional methods relying on authentic images are hindered by data protection laws. To address this, alternatives like generative adversarial networks, stable diffusion, and pixel recurrent neural networks can generate synthetic images. However, evaluating synthetic image quality is complex due to the varied techniques. Metrics are crucial, offering objective measures to compare techniques and identify areas for enhancement. This article underscores metrics’ significance in evaluating synthetic images produced by generative adversarial networks. By analyzing metrics and datasets used, researchers can comprehend the strengths, weaknesses, and areas for further research on generative adversarial networks. The article ultimately enhances image generation precision and control by detailing dataset preprocessing and quality metrics for synthetic images.

Keywords:

generative adversarial networks; image evaluation; image quality; metrics; synthetic images

1. Introduction

Generative adversarial networks (GANs) represent one of the most impressive breakthroughs in machine learning in recent years. But how do GANs work? GANs work by pitting two neural networks against each other in a competition to generate realistic synthetic data [1]. There are currently different ways to generate synthetic images, such as:

Pixel Recurrent Neural Networks (Pixel RNNs): These models are based on recurrent neural networks that generate images iteratively, considering the spatial structure and the dependencies between pixels. They generate images pixel by pixel, in a specific order, using recurring units to model dependencies [2].
Stable Diffusion: In the imaging context, stable diffusion is used to improve the quality and consistency of the generated images. It is based on the concept of iterating a generative model through a series of diffusion steps, gradually applying noise to the generated image so that it approaches a target distribution [3].
Generative Adversarial Networks (GANs): Consisting of a generator and a discriminator, GANs are a type of generative model. The generator transforms random noise inputs into synthetic images, while the role of the discriminator is to differentiate between these synthesized images and real ones. These components engage in adversarial training, where the generator aims to fool the discriminator with its outputs while the discriminator strives to sharpen its discrimination skills to classify real and fake images accurately [4].

Of the aforementioned models, this research will only focus on GANs for their visual quality, as they can generate high-quality, realistic images that are often difficult to distinguish from actual images. Additionally, GANs need to strike a balance between the intensive use of computational resources and the speed with which they can generate synthetic images.

This versatile technology generates lifelike images, videos, and audio, applicable to diverse fields like entertainment, medicine, and drug discovery. Successful GANs hinge on a distinct discriminator differentiation between real and generated data, achieved through specific distance metrics in loss functions. Instabilities in GAN training can be mitigated by altering the loss calculation’s distance metric [5]. GANs have demonstrated their versatility in generating various types of data, including images, videos, and text, making them applicable in diverse domains such as computer vision, natural language processing, and generative modeling. With advancements in artificial intelligence technology in image processing, GANs have emerged as a powerful tool for generating high-quality synthetic images through deep learning techniques [6]. By leveraging GANs, which belong to unsupervised learning algorithms, it has become increasingly challenging for human observers to visually distinguish between real and fake images generated by these models. The realism achieved by GAN-generated images has reached a remarkable level, making it difficult for the naked eye to discern any differences between the real and synthetic counterparts.

GANs represent a nascent field of study with considerable untapped potential in unexplored realms of application. As the adoption of GANs proliferates, we can anticipate the emergence of increasingly inventive and innovative utilization scenarios. The authors in [7] explain that facial privacy protection aims to remove sensitive information from images to protect privacy. Early methods used masking, blurring, or pixelation, but these often made the faces undetectable and unsuitable for standard computer vision pipelines. Later studies used GANs to generate synthetic images as a solution for facial de-identification.

The motivation for this study lies in the need to establish a quality standard in the use of technologies related to the generation of synthetic images in various fields. These technologies often generate synthetic images with defects, which limits their usefulness and the full potential they could offer. This situation points to a gap between the theoretical capabilities of these technologies and their practical application.

The rest of the work is organized as follows: Section 2 describes what a GAN is, how it works, and its variants. Also, it highlights the relevant pros and cons of the main models used to generate images in the literature and also, it shows applications, datasets used, and the metrics found in the state of the art. Section 3 details the GAN architectures used to generate synthetic images in the state of the art. Section 4 conducts experiments and analyzes experimental results. The conclusions and future of the work are included in Section 5.

2. What Is a Generative Adversarial Network?

The generator aims to create convincingly similar data, while the discriminator strives to accurately classify and identify whether the input data are real or generated. This interplay between the generator and discriminator is crucial in training the GAN model to generate high-quality synthetic data. Through the training process, GANs establish a structured vector space within different domains, akin to other neural networks like variational autoencoders (VAEs) and language models. This vector space, known as latent space, is a lower-dimensional representation of the original data domain. GANs have demonstrated their efficacy across various applications, encompassing the generation of novel images, videos, and text. Moreover, they have found practical utility in tasks such as image synthesis for realism, image enhancement, and the upscaling of image resolution.

2.1. Exploring the Process

Depending on the nature of the problem to be solved, two types of approaches are identified in machine learning: discriminative models and generative models. Given some observable variables

X

and some target variable

Y

that depends on

X

, the discriminative models learn to estimate

Y

from

X

. Instead, generative models seek to estimate a probability distribution

P_{r}

. Generally, these models take a training set consisting of samples from the distribution

P_{r}

and learn to represent an estimate of this distribution. The result is a probability distribution

P_{g}

that can be expressed explicitly, operating directly with its density function

p_{g}

, or implicitly, by generating samples from it.

A GAN is an implicit generative model. Formally, it is a structured probabilistic model based on latent variables z and observable variables x. The core of the GAN methodology is based on a two-player game: a generator, which produces samples that ideally come from the distribution of the training set, and a discriminator, which tries to differentiate between samples from the generator and the training set [8,9]. Both players are described in detail below:

The generator is a differentiable function $G : Z \times R^{g} \to X$ that has latent variables $z \in Z$ taken from a known prior distribution $P_{z}$ and some parameters $θ \in R^{g}$ , and outputs a sample $x \in X$ . Denote $G_{θ} : Z \to X$ to the parameter generator $θ$ .
The discriminator is a differentiable function $D : X \times R^{d} \to R$ that takes a sample $x \in X$ and some parameters $ω \in R^{d}$ and calculates a value that quantifies how real or synthetic the sample is. Denote $D_{ω} : X \to R$ to the discriminator parameter $ω$ .

In practice, both players are implemented with neural networks. The choice of the type of neural network to use for the generator and discriminator is free. As imaging is the field in which GANs are most widely used, the models with the best results use a convolutional neural network (CNN) architecture for the generator and discriminator [10,11].

The training of a GAN is the process by which the parameters of the neural network are adjusted so that the generator produces images or synthetic data that are more and more similar to the real data, and the discriminator is able to distinguish with greater precision between the images generated and the real ones. The goal is for the generator to produce images that are increasingly difficult to distinguish from real images and for the discriminator to be increasingly accurate in its task of distinguishing between generated and real images.

For this, the generator and the discriminator have cost functions defined in terms of the parameters of both. The generator seeks to minimize its cost function

C^{G} (θ, ω)

by controlling only its parameters

θ

and, similarly, the goal of the discriminator is to minimize its cost function

C^{D} (θ, ω)

modifying only its

ω

parameters. The solution to this game is a Nash equilibrium, a tuple

(θ^{*}, ω^{*})

that is a local minimum of

C^{G}

concerning

θ

and of

C^{D}

to

ω

. In general, the cost functions are chosen to minimize a metric (distance, divergence, etc.) between the distribution of the training set

P_{r}

and the distribution of the generator

P_{g}

.

Finally, the training of a GAN is achieved using stochastic gradient descent (or any of its variants). At each step, two sets or batches are sampled: a batch of real samples

x_{r}

from the training set and a batch of latent variables z taken from the known distribution

P_{z}

. With these parameters,

θ

to reduce

C^{G}

and

ω

to reduce

C^{D}

are updated simultaneously or alternately.

2.2. Types and Applications of Generative Adversarial Networks

In this subsection, a wide variety of types and applications of GANs that have been identified through an exhaustive review of the state of the art will be presented. This listing covers the various categories and fields of application that have been developed and applied in the scientific and technical literature.

2.2.1. Types of Generative Adversarial Networks

In this subsection, we list a variety of types of generative adversarial networks (GANs) that have emerged in the literature and research in recent years. It is important to note that the diversity of variants of GANs can be significant, and Figure 1 has been created based on the models found in our review of the state of the art. The main purpose of this figure is to provide an illustrative visual representation of the main categories of GANs that we have identified in our research.

At the top level of Figure 1, the main GANs that we have identified as benchmarks in the field are presented. These include models such as StyleGAN, Conditional GAN (cGAN), Vanilla GAN, Deep Convolutional GAN (DCGAN), CycleGAN, and Wasserstein GAN (WGAN) [4,12,13,14,15,16,17,18]. Each of these models has its own unique characteristics and applications in data generation.

Below, in the lower levels of Figure 1, the variants of the main GANs mentioned above are specifically detailed. These variants include models such as Auxiliary Classifier GAN (ACGAN), Least Squares GAN (LSGAN), Super-Resolution GAN (SRGAN), Self-Attention GAN (SAGAN), StyleGAN2, StyleGAN3, DualGAN, Wasserstein GAN with Gradient Penalty (WGAN-GP), and Text to Image Synthesis GAN [11,19,20,21,22,23,24,25,26]. Each of these variants features specific innovations and adaptations to meet the needs of various applications and data generation challenges.

The intention for presenting this classification is to provide readers with an overview of the diversity and richness in the development of GANs, as well as to highlight the continuing evolution of this research area. As the field of GANs continues to grow, it is essential to be aware of these variations and their applications in order to take full advantage of this powerful data generation technology.

It should be mentioned that the search period for the GANs were between the years 2014 and 2023, with the keywords corresponding to metrics and datasets. Figure 2 shows the frequency of the important architectures in the literature review. Although the formal definition of generative adversarial models was established in 2014, it is important to note that the most significant developments in this field have taken place since 2018 and continue today. This trend is clearly evident when looking at the growing amount of research on the topic. For example, when searching Scopus with the term “GAN”, a steady increase in the number of publications over the years can be observed, which supports the importance and timeliness of our work.

2.2.2. Applications of Generative Adversarial Networks for the Creation of Synthetic Images

In recent years, the popularity of GANs has surged due to their remarkable image synthesis capabilities. GANs are commonly employed in image augmentation and image synthesis. Image augmentation involves creating new training examples through transformations or variations applied to existing images. In parallel, GANs are adept at generating synthetic data that closely resemble real images, thereby expanding training datasets, enhancing generalization, and improving the performance of machine learning models. Furthermore, GANs exhibit notable proficiency in image synthesis, enabling the creation of entirely new images based on specific constraints or specifications. Their prowess in both image augmentation and synthesis has unlocked diverse opportunities across fields such as computer vision, graphics, and data augmentation for machine learning tasks. By harnessing the potential of GANs, researchers, and practitioners can elevate both the quality and quantity of available data, ultimately enhancing performance in various image-related applications.

The successes of GANs extend across diverse domains. They have been effectively applied in domains such as medicine, pandemics, image processing, face detection, texture transfer, traffic control, and sequential data generation (financial or musical data). Notably, successful GAN applications encompass data modeling (image generation), image super-resolution, image synthesis, image segmentation, text generation, text-to-image synthesis, medical image synthesis, image and audio signals, malware detection, climate data analysis, music generation, voice conversion, video generation, video super-resolution, etc. [27,28,29,30,31,32,33,34,35,36,37,38].

Figure 3 showcases the manifold application domains of GANs.

In essence, the potential applications of GANs are virtually limitless, bounded only by the creative capacity of developers who leverage their capabilities.

These domains represent the specific areas in which GANs have demonstrated their usefulness and versatility, showing how this innovative approach has been applied in a variety of contexts and sectors.

GANs have also been used in a variety of other domains such as chess game playing, mobile user profiling, data augmentation, autonomous driving, fires, and earthquakes.

2.3. Main Data Repositories Utilized

The datasets commonly used with GANs include:

Flickr Faces HQ (FFHQ): This collection of high-quality images featuring human faces is sourced from the photo-sharing platform Flickr. This dataset encompasses a substantial compilation of 70,000 human face images, each with a resolution of 1024 × 1024 pixels. FFHQ exhibits significant diversity in age, ethnicity, and image backgrounds, providing a rich and varied set of training data for GANs [39]. Researchers and practitioners often leverage the FFHQ dataset to train GAN models in face image generation and manipulation. Its extensive facial characteristics and high-resolution nature make it an ideal resource for training models that aim to generate realistic and diverse human face images. By utilizing datasets like FFHQ, GAN models can learn from a vast collection of human face images, capturing the nuances and complexities inherent in facial appearances. These datasets play a crucial role in advancing the capabilities of GANs and enabling their applications in various domains, including computer vision, graphics, and facial recognition systems [11]. In Ref. [40], the authors present a novel framework named AgeTransGAN for addressing facial age transformation across significant age differences, including progression and regression. The AgeTransGAN framework comprises two primary components: a generator and a conditional multitask discriminator with an embedded age classifier. The generator utilizes an encoder–decoder architecture to synthesize images, while the conditional multitask discriminator assesses the authenticity and age information of the generated images. The primary goal of AgeTransGAN is to disentangle the identity and age attributes during the training process. This disentanglement is achieved by incorporating various techniques, including cycle-generation consistency, age classification, and cross-age identity consistency. By effectively separating the identity and age characteristics, AgeTransGAN aims to enable more accurate and controllable age transformation across significant age gaps.
ImageNet: An extensively annotated, large-scale dataset utilized for visual object recognition research. It comprises an impressive collection of over 14 million images, encompassing a wide range of more than 20,000 distinct object categories [41,42].
CelebA: A unique dataset in face attribute analysis that encompasses a substantial collection of 200,000 celebrity images. Each image in the dataset is meticulously annotated with 40 attribute labels, making it a valuable resource for tasks related to face image generation and manipulation [43]. CelebA-HQ and CelebAMask-HQ developed as extensions of CelebA, are also extensively employed in similar applications.
LSUN: An acronym for large-scale scene understanding, LSUN consists of an impressive assortment of approximately one million labeled images. This dataset is organized into ten distinct scene categories, including memorable scenes like bedrooms, churches, and towers. Moreover, LSUN encompasses twenty object classes, such as birds, cats, and buses. The images explicitly belonging to the church and bedroom scenes, as well as those representing cars and birds, are frequently employed in GAN inversion methods. In addition to the datasets mentioned above, several other resources are employed in GAN inversion studies. These include DeepFashion, Anime Faces, and StreetScapes. These datasets play a pivotal role in conducting comprehensive experiments to assess the efficacy of GAN inversion techniques [44].
Flickr Diverse Faces (FDF): This comprises a vast collection of 1.47 million human faces with a resolution of 128 × 128 or higher. Each face in the dataset is accompanied by a bounding box and keypoint annotations, offering precise localization information. The dataset is designed to exhibit diversity in various aspects, including facial pose, occlusion, ethnicity, age, background, and scene type. The scenes captured in the dataset encompass a range of contexts, such as traffic, sports events, and outdoor activities. Regarding facial poses, the FDF dataset offers more diversity than the FFHQ and Celeb-A datasets, although its resolution is lower than that of the FFHQ dataset. The primary objective of the FDF dataset is to facilitate the advancement of face recognition algorithms by providing a publicly available resource that includes source code and pre-trained networks. The face images in the FDF dataset were extracted from a pool of 1.08 million images in the YFCC100-M dataset, and the annotations were generated using state-of-the-art models for accurate and reliable results [45]. The FDF dataset, with its high-quality annotations and diverse range of faces, provides a valuable resource for advancing face recognition algorithms.

These studies indicate a need for a deeper understanding of the current state of datasets used in GANs. The findings presented in [46] suggest that while most GAN models can achieve similar performance scores with good hyperparameter optimization and random restarts, further improvements can be obtained by allocating a larger computational budget and conducting extensive tuning rather than relying solely on fundamental algorithmic changes. Study [47] found that the existing measures for assessing GAN performance must be adequate for the task at hand. Ref. [48] found that there needs to be a measure to quantify the failure modes of GANs. Together, these papers suggest that the current state of datasets for GANs needs to be better understood. Further research is needed to develop better methods for assessing GANs’ performance. The next papers suggest that there are challenges in creating GAN datasets. The authors in [46] concluded that determining the superiority of one GAN algorithm over another is challenging. Their findings indicated that most GAN models could acquire similar performance scores given sufficient hyperparameter optimization and random restarts. This suggests that the comparative evaluation of GAN algorithms requires careful consideration and thorough experimentation to discern meaningful differences in their performance. It suggests that there needs to be a clear consensus on which GAN algorithm is best and that improvements may come from more computational resources rather than fundamental algorithmic changes. Ref. [49] found that data generated by a GAN cannot statistically be better than the data they were trained on. This suggests that there are limitations to what GANs can achieve. Also, the research of [4] found that generating images of an unprecedented quality is possible using a progressive, growing method for training GANs.

Regarding the datasets used in face verification research provides insights into various options [50]. For instance, Labeled Faces in the Wild (LFW) (https://github.com/securifai/masked_faces, accessed: 25 June 2023) is a commonly used dataset consisting of 13,233 face images from 5749 individuals, serving as a standard benchmark for evaluating face verification algorithms under unconstrained conditions. Another widely used dataset is CASIA-WebFace (https://github.com/securifai/masked_faces, 26 June 2023), which contains many identities (10,575) and facial data samplings (494,414), making it a popular choice as a baseline dataset for face verification tasks. CelebFaces+ has 202,599 images of 10,177 identities, and its annotated version, CelebA, has five landmark locations and 40 binary attributes annotated. MegaFace has 4.7 million samplings from 672,057 identities but has limited variations per identity. The Ms-Celeb-1M dataset is considered the largest publicly accessible face recognition dataset, comprising 10 million facial samplings belonging to 100,000 distinct identities. This dataset is a valuable resource for training and testing face recognition systems, providing a diverse range of facial images to facilitate a robust and comprehensive evaluation of face recognition algorithms. The VGGFace and VGGFace2 datasets were released by the Visual Geometry Group from the University of Oxford and are used for training and testing. YTF, UMDFaces-Videos, and UMDFaces are other face datasets. FFHQ and CelebA-HQ are high-quality datasets used for training high-quality generated images. Large corporations such as Facebook and Google have large in-house datasets.

Table 1 summarizes the features of the relevant datasets found in the state of the art.

Figure 4 shows the frequency of the important datasets extracted in the literature review.

2.4. Main Metrics Used

To conduct this literature review, the authors in [9] introduced and applied the FID metric, highlighting its relevance in evaluating the results generated by GANs. The main focus was on understanding how the FID metric has been employed in various investigations. However, during the course of this review, it was observed that many studies simply use the FID metric without addressing essential technical aspects, such as data preparation or balance in the dataset. This observation led us to broaden our search beyond papers that only use the FID metric. We included research proposing new metrics aimed at assessing the quality of results generated by GANs, which allowed for a more complete and comprehensive view of evaluation in this field.

The correct use of metrics in GANs is of the utmost importance for accurately assessing the performance of these models. Choosing appropriate metrics for evaluation can make or break the results of a scientific paper, as it can affect the validity and reliability of the findings. Some of the most commonly used metrics with a brief description are as follows:

Inception Score (IS): This measures the quality and diversity of generated images by using a pre-trained Inception-v3 classifier network to calculate the average probability distribution of classes for the generated images. The score is based on two factors: the quality of the generated images (measured using the entropy of the class distribution) and the diversity of the generated images (measured using the Kullback–Leibler divergence between the class distribution of the generated images and that of the training set). A higher IS indicates that the generated images are high quality and diverse [56].
Fréchet Inception Distance(FID): The similarity between the feature distributions of real and generated images are quantified using a high-dimensional feature space. It uses a pre-trained Inception-v3 network to extract features from both sets of images and calculates the FID between their distributions. A lower value indicates that the generated images are more similar to the real images regarding their high-level features [47,57].
Perceptual Image Patch Similarity (LPIPS): This is a metric that quantifies the perceptual similarity between image pairs by comparing their feature representations in a deep neural network. It calculates the distance between the feature representations of the two images at multiple layers of a pre-trained network and aggregates these distances to derive a similarity score. A lower score indicates a higher perceptual similarity between the generated and real images [58].
Precision and Recall: These are standard measures used in machine learning to evaluate the performance of classifiers. They can also be used to evaluate the performance of GANs by comparing the generated images to real images and computing the precision and recall of the generated images for the real images [59]. An important comment concerning precision and recall is that they may not be the most appropriate metrics for evaluating the performance of GANs, as they only provide information about the ability of the models to identify positive samplings correctly and may not capture other aspects of the image quality, such as diversity, realism, and perceptual similarity.
Structural Similarity Index (SSIM): This measures the similarity between pairs of images based on their luminance, contrast, and structural similarities. It compares the pixel values of the two images at each point in a local window, taking into account the contrast and structural similarities between the regions of the window. A higher value indicates that the developed images are more comparable to the real images regarding their structural properties [60].
Mean Squared Error (MSE): This measures the average squared difference between the pixel values of the generated and real images. It is a simple and widely used metric but does not consider perceptual differences between the images [24].
Kernel Inception Distance (KID): This measures the similarity between the feature distributions of the real and generated images using a kernel-based method. It calculates the squared distance between the kernel means embeddings of the two distributions, where the feature space of a pre-trained Inception-v3 network defines the kernel. A lower value indicates that the generated images are more similar to the real images regarding their high-level features [61].

In [62], the problem of evaluating the quality of samplings generated by a machine learning model is addressed. Two well-known metrics are used: precision and recall, and accuracy. These measure the proportion of generated samplings that are correct, i.e., that resemble the real samplings. Recall measures the proportion of actual samplings that are correctly generated. However, the precision and recall metrics have certain limitations when assessing the fidelity and diversity of generated samplings. Fidelity refers to the similarity between generated and real samplings, while diversity measures the extent to which the generated samplings represent the full range of variability in the real samplings. Two novel metrics, namely, density, and coverage, are introduced to address these limitations. Density evaluates the proportion of total variability in the real samplings captured by the generated samplings, while coverage assesses how well the generated samplings match the actual distribution of samplings. Furthermore, the study suggests that random embeddings outperform ImageNet pre-trained embeddings, mainly when the target distribution significantly differs from the ImageNet statistics. This highlights the importance of carefully selecting appropriate embeddings when evaluating the quality of generated samplings.

In the study conducted in [63], the focus is on evaluating the quality of generated Earth observation images using state-of-the-art GAN models. The authors compare the results obtained from these models with high-quality unconditional image synthesis achieved in other domains. To assess the quality of the generated images, they employ two widely used metrics, namely FID and KID. These metrics measure the similarity between the distributions of fake and real images by analyzing the feature maps of an Inception-v3 classifier. However, the paper concludes that while FID and KID are commonly used metrics, they may not accurately represent the visual quality of images. The authors find that models performing well in terms of FID and KID on smaller datasets may be more suitable for tasks where training data need to be balanced. This implies that the metrics alone may not provide a comprehensive assessment of visual quality and should be considered alongside other factors when evaluating the performance of GAN models for Earth observation image generation.

The evaluation measures employed include FID, W-DISTANCE, and MMD, which calculate the distance between two distributions [64]. On the other hand, IS and its variants, such as m-IS, mode score, and AM score, utilize conditional and frontier distributions of generated or real data to assess the diversity and fidelity of samplings. Average log-likelihood and range metrics are utilized to evaluate probability distributions. Reconstruction error and specific quality measures assess the dissimilarity between generated images and their related or nearest counterparts in the training set. An ideal evaluation metric should possess well-defined bounds and demonstrate sensitivity to image distortions and transformations [57].

GANs undergo evaluation using various metrics to gauge the quality of the generated images. Each metric has advantages and limitations, and the selection should align with the specific evaluation objectives. Typically, the IS and FID metrics are widely employed to assess the quality and diversity of the generated images. The IS metric quantifies the quality and diversity by calculating the KL-divergence between the class probabilities of the generated and real images. On the other hand, the FID metric measures the distance between the feature distributions of the generated and real images within a high-dimensional feature space.

In contrast, the LPIPS, SSIM, and KID metrics focus more on measuring perceptual and high-level feature similarities between the real and developed images. LPIPS estimates the perceptual similitude between pairs of images by comparing their feature representations in a deep neural network. SSIM computes the structural similarity between the authentic and developed images based on luminance, contrast, and structural differences. KID uses kernel methods to measure the discrepancy between the empirical dispersals of the characteristics extracted from the real and developed images. Finally, MSE is a simple and fast metric that measures the pixel-wise difference between the generated and real images. However, it may be better used to evaluate the visual quality of GAN-generated images, as it needs to consider perceptual similarities or high-level features. Therefore, the metric preference should be based on the specific conditions of the evaluation task and the strengths and weaknesses of each metric. These are just the standard metrics that can be used to estimate the implementation of GANs. The choice of metric will depend on the specific problem we are trying to solve and the desired properties of the generated images. Simply comparing the original dataset with the new dataset of synthetic images is not enough, as it is possible to generate a significant number of flawed images that do not reflect the real data.

This is where careful analysis and the comparison of real and synthetic images become crucial. It should be known that GANs have their advantages and disadvantages, which is why if we are guided and understand the data we have, we can obtain better results, as may be the case in the generation of synthetic faces, where not all faces that are generated are perfect. Many of these have many defects, which do not overcome the filter at a glance [56]. By using a combination of metrics well suited to the task at hand, researchers can gain a deeper understanding of the strengths and weaknesses of their GAN models. This, in turn, can lead to improvements in the development and evaluation of GAN models, which is crucial for advancing the field of generative modeling [65]. Moreover, selecting appropriate metrics can also impact the applicability of GAN models in real-world scenarios, such as in computer vision applications, where high-quality and diverse image generation is a key requirement [66].

Figure 5 is primarily intended to illustrate that, although generative adversarial networks (GANs) are often associated with high-definition image generation, our comprehensive literature review reveals a wide variety of metrics used to evaluate GAN-generated images. These metrics have been variously applied in different contexts, demonstrating their versatility beyond GANs. It is important to note that all the metrics mentioned in our article are derived from a thorough analysis of the state of the art in this field. Among the most commonly used metrics are FID, PSNR, IS, and SSIM, with FID being particularly relevant due to its specific definition for generative models such as GANs.

In Table 2, we have summarized the total information we found in the literature reviewed in this research related to the architectures, datasets, and metrics used. It is information that allows us to see that, as in several articles in the literature, there is no absolute metric, so we will always find trends of use in the most used metrics, such as FID or IS, which in turn are in some cases accompanied by another metric, which varies according to the research. We can also see that there is a clear tendency to use the CIFAR-10 dataset, although the FFHQ dataset is also being used more frequently, at least according to the tendency found in this research. Also, it can be seen that the characteristics of the CIFAR-10 dataset make it very useful because the default GANs require high-quality images and computational capacity, which is why this versatile dataset is widely used [67]. On the other hand, it is also noteworthy that the top three in each item were arbitrarily selected, and then the intersection of information was performed.

Understanding the risks posed by emerging technologies is a prolonged process. Regulations offer guidance, but assessing their effectiveness and ethical alignment comes with implementation. AI raises ethical concerns in terms of privacy and data protection. Dealing with substantial personal data can jeopardize user privacy and security. Prioritizing robust security measures and user data control is essential. The AI society, economic, and political impact, including surveillance and social control, is substantial. The big data paradigm enables vast, complex data utilization, but lacking ethical principles can breach individual rights. Artificial intelligence, machine learning, and big data drive innovation and digital transformation. Yet, mismanaging data risks invites identity, privacy, and reputational threats. Regulations like GDPR address these issues, covering data ownership, minimization, and accuracy. Ethical aspects in AI, machine learning, and big data mirror concerns about identity, privacy, and reputation [75,76].

2.5. Advantages and Constrains of Generative Adversarial Networks

GANs have shown great potential in various fields, but they are not perfect and require careful consideration and tuning to overcome their constraints.

2.5.1. Advantages

The primary advantage of GANs lies in their ability to develop synthetic data that nearly correspond to real data. This is achieved through a competitive learning process between the generator and discriminator networks, where the generator improves its capacity to create increasingly realistic data. This property of GANs makes them highly valuable for tasks such as data augmentation and generating training data for machine learning models [1]. While GANs are commonly used for image generation tasks, they can also be applied to other data types, including text or audio. By manipulating the latent code, which serves as the input to the generator network, specific attributes of the generated image can be modified while preserving the other attributes [77]. However, it is essential to note that attribute manipulation in the latent space is limited to the generated images produced by the GAN generator and does not extend to real images [4]. GANs lack the inference capability for real data, meaning that the attribute manipulation is specific to the synthetic data generated by the network. In [78], the authors perform, by applying DCGAN, a data augmentation, to increase the diversity of the dataset. It was applied to thorax radiographs. The research covers several issues, such as privacy related to patient data, the lack of access to large volumes of information, and the diversity of information. Also, from the results obtained for data augmentation using generative models, a better performance in terms of accuracy has been achieved when compared to a traditional CNN model. There are several advantages of using GANs for data generation:

GANs can generate visually realistic, high-quality images that resemble real images to human observers.
Generating diverse data samplings by GANs is beneficial for training machine learning models.
GANs exhibit relative ease of training and often achieve faster convergence than other generative models.
GANs possess the potential to acquire knowledge from data lacking important label information, rendering them valuable for unsupervised learning endeavors.

2.5.2. Constraints

Conversely, GANs exhibit limitations in terms of replication rate, which exhibits an exponential decay concerning dataset size and complexity, resulting in image quality shortcomings. The authors in [57] posit a hypothesis that an increase in dataset size leads to heightened dataset complexity, thereby causing a decline in the replication rate. Additionally, they present that the quality of generated samplings initially deteriorates but later improves as the number of training samplings increases. The practical aim of the study in [79] is to enhance our understanding of the relationship between dataset size and replication in the context of GANs during image synthesis. To achieve this, they have developed a tool capable of predicting the required number of training iterations before the onset of exponential decay in the replication rate. This exploration sheds light on the underlying mechanisms behind GAN replication and over-fitting, contributing to a more comprehensive understanding of these phenomena.

GANs are potent models for generating synthetic data, but they are notorious for their unstable training process, which is plagued by several challenges. This section will review some of the critical challenges that arise during GAN training. The following list shows the common obstacles:

Vanishing gradient: This problem can lead to the gradients of the generator, concerning weights in the early layers of the network, becoming so small that those layers stop learning, resulting in poor-quality image generation [80]. In addition, a well-trained discriminator network may confidently reject the generator-generated samplings due to this problem. The challenge of optimizing the generator is compounded by the fact that the discriminator does not share any information, which can damage the overall learning capacity model [81].
Mode collapse: Mode collapse poses a critical challenge in GAN training, leading to the generator consistently producing identical outputs. This failure in GAN training is attributed to the generator exhibiting low diversity in its generated data or producing only a limited range of specific real samplings. Consequently, the utility of the learned GANs becomes restricted in numerous computer vision applications and computer graphics tasks [82].
Shortcoming of accurate evaluation metrics: The lack of accurate evaluation metrics is a critical challenge in generative adversarial network (GAN) research. Evaluating the quality of images generated by GANs remains an active area of research, especially with regard to unstable training. Despite the demonstrated successes of GANs in a variety of applications, determining which method is superior to another in terms of evaluation remains a complex problem. Currently, there is no universally accepted standard for evaluation, and each paper introducing a new GAN proposes its own evaluation techniques. This results in a lack of agreed-upon parameters that make a fair comparison of models difficult [57,83,84]. One of the main challenges of current GAN evaluation metrics is to have a measure that assesses both diversity and visual fidelity simultaneously. Diversity implies that all modes are covered, while visual fidelity implies that the generated samples must have a high probability. Another challenge is that some measures are less practical to calculate for large sample sizes. Also, there is currently no powerful universal measure for assessing GANs, which may hinder progress in this field [83,85]. While various quantitative and qualitative metrics have been proposed to evaluate generative models, their complexity and limitations make the choice of a single metric a complicated process. This complexity is reflected in the fact that the precise evaluation of GANs remains an active and evolving research topic. Therefore, the criteria provided in this paper provide a suitable framework for evaluating and selecting appropriate metrics for the evaluation of generative models, taking into account quality, stability, global capturability, robustness, efficiency, interpretability, and relevance to the specific task [57]. Current metrics mainly focus on the comparison of feature distributions between real and GAN-generated data. In these cases, the visual quality is expected to be an indirect result of the evaluation and is not optimized or tested directly using these metrics. It is important to note that, despite the effectiveness of existing metrics in quantitatively assessing the statistical similarity between real and generated data, the assessment of visual quality remains an open challenge in the field of GANs. So far, no metric has been developed that can directly assess the perceptual quality of the generated images in an effective way [86]. Finally, it can be said that an effective evaluation metric for generative models, such as GANs, must meet several key criteria. It must be able to distinguish between models that generate high-quality samples and those that do not, as well as be robust to small perturbations in the input data. In addition, the measure must capture the overall structure of the generated samples, be sensitive to mode dropping to detect deficiencies in the generation of specific patterns and be computationally efficient for practical application. Interpretability is essential for understanding the results, and the choice of metric must be aligned with the specific task it seeks to address, ensuring its relevance in the context of the application [83,84].

From a reasonable view, it is crucial to mention that GANs face several challenges, including instability between the generator and discriminator. It is essential to balance the strictness of the discriminator to ensure that it is not too strict. Another challenge is determining the correct positioning of objects in an image, which can result in the generator creating unrealistic images. GANs also struggle to understand the global or holistic structure of images, similar to the perspective problem. In artificial intelligence and machine learning, GANs stand as a potent tool. Their versatility allows for their utilization across various tasks, showcasing distinct advantages over alternative generative models, including autoencoders, variational autoencoders, Boltzmann machines, PixelCNN, and generative flow models. A series of negative effects that GAN networks can have are listed, such as the generation of faces used in false profiles on social networks to commit criminal acts and the generation of false images and videos of real people without their permission. When GANs generate synthetic faces, they commonly focus solely on the facial region, often neglecting the rest of the body. In some cases, specific models may only generate a mask of the face. Although these frontal views of the faces can be caught and utilized as synthetic image data, there are several challenges associated with GAN training. As a result, the generated images may exhibit various flaws and imperfections. Moreover, GANs typically require much training data to achieve satisfactory results. Limited availability or small dataset sizes can pose challenges and hinder the quality of generated outputs. Moreover, GANs are vulnerable to mode collapse, in which the generator fails to capture the desired diversity in the data distribution and produces a limited range of outputs. This can lead to a need for more variety in the generated images. Addressing these challenges is an active area of research, and advancements are being made to improve GAN-generated image stability, diversity, and overall quality [87].

3. Advancements in Generative Adversarial Network Research

GANs can be classified into three categories, unsupervised, semi-supervised, and supervised learning approaches, depending on the available training data and learning objectives. To better understand the current state of GANs, we have segregated them based on their learning approach. GANs are frequently employed in semi-supervised and unsupervised learning scenarios. In essence, GANs utilize a supervised learning approach to simulate unsupervised learning by generating synthetic data that closely resemble real data.

3.1. Unsupervised Generative Adversarial Networks

Unsupervised GANs, as the name suggests, are trained without any labeled data. They learn to generate samplings from an unknown data distribution by optimizing a generator network and a discriminator network that plays a minimax game [88].

GAN models converge when the latter cannot differentiate whether an image has been artificially generated. One of the main benefits of using GANs is that data generation can be automated. Generating synthetic human faces with a range of features and realism has facilitated the creation of artificial datasets for facial recognition, addressing privacy relations associated with real data [89]. Synthetic data have emerged as a viable alternative for acquiring large datasets, enabling researchers to generate diverse and representative samplings without compromising privacy. The performance of networks trained on synthetic facial recognition data has historically been poor. The domain gap between synthetic and real-world data has frequently been substantial, leading to sub-optimal performance when deploying models in real-world scenarios [68]. Recent advancements in GAN models, particularly StyleGAN2, have showcased significant advancements in generating highly realistic and visually appealing human faces. StyleGAN2 has notably improved the visual quality and pixel resolution and minimized or eliminated artifacts in the generated faces, resulting in a more convincing and authentic appearance [23]. It is very important to mention that the datasets used have real human faces from the FFHQ dataset, which Nvidia proposed as their TensorFlow implementation [9,39]. The GAN face generation model has achieved the highest synthesis quality in generating realistic human faces. This model can generate faces at a high resolution of 1024 × 1024 pixels, resulting in highly detailed and lifelike facial images [9]. Using this model has a big impact because it requires a large amount of computational power.

StyleGAN3 improves the weakness of StyleGAN2 by tackling the texture sticking that happened in the morphing transition between faces [9]. Also, positional references are available in the intermediate layers for the network to process feature maps with the next sources: per-pixel noise inputs, positional encodings, image borders, and aliasing (the hardest one to identify and fix) [10,23]. The main goal was to eliminate all positional references, and this goal was reached by making the network equivariant, where an operation (e.g., ReLU) should not insert any positional references [11].

A substantial training dataset is required to achieve a diverse range of generated data. However, it is crucial for research to investigate how the proportions and quality of this original dataset impact the dispersal of outcome data samplings. The initial models released in StyleGAN utilize the FFHQ and CelebA-HQ datasets, providing models trained at a resolution of 1024 × 1024. Aside from these original models, a few unofficial implementations of StyleGAN in other frameworks propose models trained on alternative facial datasets with variable resolutions and quality. Comparing these alternative models with the original StyleGAN is not feasible due to their unofficial nature. This work serves as an initial step and an essential tool in understanding the influence of the original size and quality of the datasets on the quality and dispersal of generated data samplings. A research gap exists concerning the quantity of data and deviation demanded to train StyleGAN adequately and the association between the original training samplings and the generated samplings. This study addresses this gap by examining the relationship between the original training samplings, generated samplings, and various aspects of StyleGAN. The authors trained StyleGAN on diverse face datasets with various resolutions to provide a valuable resource for researchers. This research enables the exploration of various aspects of StyleGAN and the impact of the length and quality of the original dataset on the generated data samplings. Furthermore, it serves as a crucial step toward constructing small and scalable datasets of synthetic facial data. The study investigates the practical training requirements of StyleGAN, the distinction between the authentic and synthetic data samplings, and the relationship between the generated samplings [50]. It is important to note that GANs can be adjusted to improve synthetic facial data generation rules. The association between the training dataset and the generated data is crucial, as a more extensive training dataset leads to more significant variations in the output. StyleGAN, a widely employed deep learning model for generating high-quality images, has been trained on diverse subjects such as anime, cars, cats, bedrooms, and more. The original StyleGAN models were trained on the FFHQ and CelebA-HQ datasets, which had relatively modest sizes of 70,000 and 30,000 samplings, respectively, and produced outputs at a resolution of 1024 × 1024. While there are alternative implementations of StyleGAN using different frameworks like PyTorch, they are unofficial and should be compared to the original StyleGAN [23].

StyleGAN has demonstrated linear and disentangled arrangement of its latent space, allowing for traversal directions that selectively modify specific image properties without affecting others. However, these editing capabilities are limited to the latent space of the StyleGAN model and apply only to images generated by the model itself. Recent research efforts have aimed to bridge the gap between the training and target domains by fine-tuning the StyleGAN generator and modifying its weights. This enables the utilization of the disentanglement properties of StyleGAN for assignments such as unsupervised segmentation mask generation. Other investigations have also explored the extraction of segmentation maps using the structure of StyleGAN. By training a StyleGAN model on a specific disentangled axis, researchers have shown that it is possible to overcome domain-related limitations and adapt the generator to suit specific requirements [72].

In [90], a unique application of StyleGAN is presented, diverging from the conventional focus found in other studies. The authors introduce a modified variant called StyleGAN-XL, which exhibits the ability to invert and manipulate images beyond the limited scope of portraits or specific object classes. This extended model demonstrates smooth interpolations between samplings belonging to different semantic categories. The primary objective of the study is to train a StyleGAN3 generator on the large-scale ImageNet dataset successfully. Success in this context is defined by the quality of generated samplings, primarily evaluated using the IS and the diversity measured by the FID. The contributions made by the authors enable the training of a significantly larger model compared to previous attempts while simultaneously reducing the computational requirements. Specifically, the proposed model exhibits three times greater depth and parameter count than a standard StyleGAN3 architecture. This hybrid model effectively combines the distinct semantic properties from both input domains, thus allowing for style mixing. While style mixing is commonly employed within a single domain, such as combining two human portraits, the authors expand its application to encompass cross-domain image manipulation and inversion.

The authors in [70] present a new method for detecting GAN-generated images based on analyzing a picture from an inversion process of the GAN. The inversion procedure projects the image under research into the GAN’s latent space and back into the image space, producing a resulting image that is compared with the authentic using similitude metrics. The experiments show that landmark-based metrics effectively capture the distinctive traits of synthetic images, and the detector is robust to typical post-processing. The main contributions include using the GAN generator’s underlying mechanisms for detection, demonstrating that generative approaches have structural errors that can be revealed through features, and extending the technique to any generator with an inversion, reducing the need for retraining. The authors in [70] present a new method for detecting GAN-generated images based on analyzing a picture from an inversion process of the GAN. The inversion procedure projects the image under research into the GAN’s latent space and back into the image space, producing a resultant image compared with the authentic image utilizing similitude metrics. The authors also deliver a corpus of face images and their reconstructions.

3.2. Semi-Supervised Generative Adversarial Networks

Semi-supervised GANs represent a hybrid approach that integrates elements of both unsupervised and supervised GAN frameworks. These models leverage a limited set of labeled data alongside a substantial volume of unlabeled data to enhance the overall performance of the generator and discriminator components. Semi-supervised GANs aim to achieve improved generative and discriminative capabilities by incorporating this combination of labeled and unlabeled data.

The research presented in [91] introduces a novel model known as the identity preservation conditional generative adversarial network (IPCGAN), designed specifically for generating synthetic faces within specific age groups. The authors employed an image resolution of 227 × 227 pixels, optimizing computational efficiency. Preserving identity while generating new synthetic faces is a complex computational task. One of the critical challenges in GAN architectures is the large number of parameters and the high computational cost associated with training, necessitating a substantial amount of sample data to achieve satisfactory model performance. In the case of IPCGAN, as a baseline architecture, ample training time is required for convergence. Moreover, there is room for improvement in maintaining the identity of the input image. The weighted model employed serves as a basis for creating synthetic face images, with each input image generating multiple images corresponding to different age groups. Consequently, the architecture produces five hundred images representing five age groups derived from one hundred distinct identities. These generated images are then utilized in the evaluation process. Enhancing the age classification capability of the model contributes to its overall proficiency in creating realistic aging effects.

3.3. Supervised Generative Adversarial Networks

Supervised GANs, on the other hand, are trained on labeled data and learn to generate samplings that match the given labels. The generator and discriminator are both conditioned on the class labels during training.

In [8], the authors concentrate on automating sexual facial expression recognition (SFER) to identify sexual expressions as macro facial expressions. Their aim is to detect erotic content in images, particularly aiding in cases of child sexual exploitation material (CSEM) where only facial images are available. They extend the conditional generative adversarial networks (CGANs) to auxiliary classifier GANs (AC-GANs) for controlled image generation. The research introduces the Triple-BigGAN model to enhance facial expression recognition through conditional image synthesis. They introduce a new dataset, SEA-Faces-30k, due to the scarcity of SFER datasets. The experimental results show the Triple-BigGAN’s ability to generate high-quality images that enhance the error rates of supervised-learning-based methods. Its performance is evaluated using FID and IS scores, surpassing state-of-the-art methods with 93.59% accuracy.

Additionally, the performance of Triple-BigGAN is compared with inference-based GANs on the MNIST, CIFAR-10, and SVHN datasets, achieving superior results. Ref. [5] validates the significance of FID over IS as an evaluation index for GANs. However, directly using FID presents challenges due to resource-intensive time and memory requirements.

Ref. [92] discusses a framework named InterFaceGAN that aims to comprehend and interpret the untangled face representation learned by GANs in face synthesis. The authors study the properties of the facial semantics encoded in the latent space of the GANs and find that the GANs learn diverse facial attributes in some linear subspaces of the latent space. They also manage to manipulate these attributes, including gender, age, expression, and eyeglasses, and pose and fix artifacts in a photo-realistic way. The authors perform a detailed analysis of the relationship between attributes and editing results, applying the InterFaceGAN approach to the editing of real faces. The results indicate that GANs learn a controllable and disentangled face representation through face synthesis. They also explore the understanding of GANs and how to realistically manipulate facial attributes. However, they highlight the need to investigate the interpretation of GANs on more general objects and scenarios and to improve methods for unsupervised learning of semantics. Their work on InterFaceGAN represents an advance towards a broader interpretation of GAN models.

The research discussed in [69] reveals that, despite an optimal set of generator parameters, GAN training often fails to discover them. This raises the question of whether non-adversarial generative models should be considered as potential replacements for GANs. Experimental findings indicate that training a combination of GANs yields more promising outcomes than raising the complexity of standalone networks that are already adequately complex for modeling multi-modal data. Moreover, current GAN models such as BigGAN, CR-BigGAN, and LOGAN require significant computational resources, utilizing numerous TPU cores that match the image height or width. As a result, many researchers need more resources to conduct large-scale experiments of this nature. However, with increased computing power, it is anticipated that large mixtures of GANs could be trained on datasets like ImageNet 128 × 128, leading to further advancements in the field.

Ref. [93] proposes a new hybrid generative model that merges an AE and a GAN called dual distribution matching GAN (DDMGAN). The authors aim to address issues with GANs, such as training flux and mode collapse, by performing distribution matching in the data and latent representation spaces. The low-dimensional latent representation space is obtained through training an AE. The authors perform empirical evaluations on different datasets to demonstrate the efficacy of DDMGAN in stabilizing the training process and increasing the mode coverage for the GAN. It mentions various GAN variants proposed to address these problems, including changes to network architectures, optimization procedures, alternative probability distances, and regularizations on the discriminator. The article also discusses autoencoders (AEs), which are used to learn low-dimensional latent data representations and generate samplings from these representations. AE-based generative models are less likely to suffer from training instability or mode collapse compared to GANs. Thus, AEs have been used in various ways to improve GANs, such as employing AEs as discriminators, augmenting GANs with encoders for inference, or directly combining AEs with GANs. The DDMGAN model simultaneously performs distribution matching in the data and latent representation spaces. The authors have conducted academic analysis and comprehensive empirical evaluations on various datasets to verify the efficacy of the DDMGAN method in improving GANs.

There are alternatives, such as reducing the number of high-quality image samplings without reducing the quality of the results. It shows that even extracting

30 %

of the high-quality images from the training set, the method in this work can still perform better-quality image synthesis, measured by IS and FID, on the dataset of CIFAR-10. All GAN methods require high-quality images but face the challenges of heavy dependency between computer resources and time. In contrast, gathering low-quality images is easier and cheaper. This method is called conditional transferring features, which can improve the quality of image generation and the scalability of the object types of GAN [71].

Ref. [94] presents a suitable framework for heavy rain subtraction and super-resolution that uses an interpretable IDM-based network and facial attention mechanism of FCGAL. The interpretable IDM-based network is designed for physics-based heavy rain subtraction, and FCGAL improves facial structure expressions by enabling facial attention and learning local facial authenticity examiners. Additionally, the study introduces novel training and test datasets designed explicitly for low-resolution heavy rain face images. These datasets utilize CelebA-HQ, which consists of clean facial images. The source code for generating these images is publicly available for further research. Researchers can utilize these datasets as a benchmark and measure evaluation scores to compare the performance of different methods. The proposed network demonstrates the capability to effectively remove heavy rain, enhance resolution, and improve image visibility. The experimental results indicate that it outperforms state-of-the-art methods and exhibits superior performance for joint heavy rain removal and super-resolution compared to conventional approaches such as image-to-image translation, heavy rain removal, and super-resolution models.

Ref. [74] aims to generate an effective and efficient method for transferring facial movement from a source video to a single image, resulting in a new video that imitates the source. Despite progress in facial image animation, generating believable facial movements remains challenging in computer graphics. This model uses GANs with a motion transfer model to distinguish foreground from background and make facial transformations such as translation, rotation, and gaze shift. Compared to prior methods, it is a new approach focusing only on manipulating facial expressions. The network generates synthetic realistic-image video frames for a target image utilizing synthetic intake from a parametric face model and precise image manipulation. Adversarial training is used to improve accuracy in post-processing conversion. The proposed technique provides more coherent and visually high-quality videos, leading to better-aligned landmark sequences for training.

The research presented in [7] introduces a unified framework called Priv-FairGAN (PF-GAN) that aims to enforce image privacy and fairness protections simultaneously. Using contrastive learning, the framework combines state-of-the-art GAN models with de-identification constraints, similarity measurements, data balancing, and fair pre-trained weights. PF-GAN can effectively remove private information, ensure fairness, and address data and model utility concerns by integrating these components. Experimental evaluations conducted on the CelebA-HD dataset demonstrate the effectiveness of the framework in providing fair privacy and fairness protections while achieving high prediction accuracy. Future work in this area involves extending the proposed methods to other computer vision tasks, such as segmentation, and further enhancing fairness protections to minimize performance disparities between generated and real images. Additionally, the authors trained a GAN model using a training set of over 10,000 male or female facial images. The generated facial images were then utilized for training a smile classification task, employing a MobileNet model with a resolution of 224 × 224. The trained model was evaluated on a test set of 800 images (400 male and 400 female) from the larger dataset of 8000 images used for training. This research also explores image segmentation as an alternative approach to reducing flaws in each image.

The current state of the art in GANs has significantly advanced synthetic image generation, especially with the architecture known as StyleGAN. This architecture has demonstrated its ability to generate high-quality and realistic images, overcoming many previous limitations in image generation. However, despite the impressive results obtained, there are still challenges and limitations in generating images that resemble real people without noticeable flaws and defects. One of the critical areas where these limitations are evident is in the blending of facial features. Despite advances, there are still difficulties in achieving an accurate and realistic representation of the facial features of a person in the generated images. Facial feature blending involves the seamless and coherent combination of facial features like eyes, nose, mouth, and other distinguishing features. The challenge lies in generating images where these features blend naturally and organically without producing noticeable visual artifacts or distortions.

Table 3 summarizes various research papers in the field of GANs, including the metrics utilized to estimate the performance of the models, the datasets used for training and testing, and the specific architectures employed. The table illustrates the diverse range of metrics used to assess the quality of GANs, highlighting the wide variation in the field. The results clarify that numerous quality standards exist due to the diversity of datasets and variants in GAN architectures.

One of the factors contributing to this limitation is the intrinsic complexity of facial features and their variability in the human population. Human faces possess a wide range of unique features and subtleties, such as the shape of the eyes, the contour of the nose, and the structure of the lips, among other details. The accurate generation of these features and their appropriate blending requires an advanced understanding and modeling of facial variability and structure. In addition, the representation of skin texture and other microscopic details also presents significant challenges. The texture of human skin is highly complex and varies across different regions of the face, with features such as pores, wrinkles, and subtle patterns. Realistic imaging must capture these details in their proper context and achieve a natural and believable appearance. Current methods in GAN have addressed these limitations by using latent coding models and incorporating style and structure techniques in image generation. These techniques allow for greater control and manipulation of facial features, improving the quality and realism of the generated images. However, there is still room for improvement in consistency and fidelity in the representation of complex facial features and their seamless blending. Analysis of the existing literature provides a clear insight into using image datasets in GANs. Many researchers need to filter adequately and segment datasets when using GANs. This is due to excessive confidence in the capacity of these artificial intelligence models, which, although in some cases produce outstanding results, in others, still need to meet expectations. A prominent example of this phenomenon is the StyleGAN architecture, which has been shown to achieve impressive results in generating synthetic images. However, it is also notable that this architecture can generate images with significant defects. This suggests that the imaging process is largely left to chance without paying enough attention to the quality of the data provided to the GAN.

The review of existing articles supports this observation, since there is evidence of a need for more care in the quality of the data used in the training of the GANs. The proper selection and preparation of datasets is essential to obtain accurate and high-quality results in imaging. However, many researchers overlook this crucial step and trust in the ability of the GAN to produce satisfactory results. This lack of attention to data quality can have several consequences. On the one hand, it can result in images generated with visible defects, such as distortions or undesirable artifacts. This compromises the usefulness and realism of the generated images. Furthermore, it can affect coherence and consistency in image generation, limiting its applicability in various contexts and tasks. One possible explanation for this trend is the availability of large, diverse datasets containing low-quality images or unwanted features. This can lead to inadvertent acceptance of these data into the GAN training process. In addition, the inherent complexity and variability in image data can also make them difficult to clean and segment properly, leading to less attention to these critical aspects. It is essential to recognize the importance of carefully cleaning and segmenting datasets using GANs. This involves the removal of low-quality images, noise, or unwanted data, as well as precise segmentation of regions of interest. By paying attention to the quality of the data and ensuring that they rae representative and relevant to the specific task, the ability of the GAN to generate high-quality and realistic images can be significantly improved.

4. Experiments

The FFHQ-Aging dataset is used to perform the experiments [40]. FFHQ-Aging is an extension of the original FFHQ, but adds tags for each image with gender, age group, head posture, whether they wear glasses, and other tags, allowing images to be segmented into subsets and used for experiments.

To carry out the experiments, it was decided to use the FFHQ dataset in its 256 × 256 pixel version, since it is a resolution large enough to be able to appreciate details and nuances in the faces of people. On the other hand, being a moderate resolution, it allows for faster training than if the network had to be trained to generate 1024 × 1024 pixels.

Table 4 presents a comparison based on the results obtained when performing retraining with the data segmented by age segment, and size of the dataset, where it is possible to observe that the results for the FID are very different from those that we can observe in the original paper, where the FID obtained is at 2.84.

Figure 6 shows a collage of images generated by StyleGAN2 and StyleGAN3, respectively. This example highlights an important point in the evaluation of generative models: although the FID values obtained by each model may be very good and close to zero, these values alone do not guarantee the perceptual quality of the generated images. An FID close to zero indicates that the feature distributions of the original and synthetic datasets are very similar to each other. However, human perception of image quality is not based solely on statistical similarity to real data. It is important to note that the quality of a generated image can be affected by a number of factors, such as visual artifacts, lack of consistency in the image, or the presence of details that appear inconsistent. In the case of the images generated by StyleGAN2 and StyleGAN3, although the FID values may be excellent, visual perception reveals that there are still defects perceptible to the human eye. This underlines the importance of not relying solely on evaluation metrics, such as FID, to judge the quality of the generated images. Human assessment remains critical in determining the actual quality of images and their fidelity.

However, if we take care and segment the data by age, and truncate part of the statistical distribution, taking out the outliers, we can obtain higher FID values, but with more accurate results, that is, images that do not present defects or imperfections and that, to the human eye, qualify as real.

A positive trend is observed in all age segments with a greater provision of training images. Throughout the experiments, an observed threshold of approximately 10,000 images is the minimum requirement to achieve FID values less than 10. In particular, when the image count exceeds 10,000, a direct correlation with FID becomes apparent.

Previous experiments show that the generated images are devoid of blemishes or defects. This underscores the importance of understanding the available data. In addition, models of this nature require substantial image volumes.

In Figure 7, it is clear to see that better results are achieved in terms of quality and a more realistic appearance because defects have been reduced by a large percentage, as well as having greater control over what is generated. Although, higher values can be seen than those presented in the original articles, if the quality is improved, it is more applicable to real-world problems.

The important thing to note is that there is no failure of the metric itself, but rather, that by averaging the complete distribution, a lower FID is obtained because the average data are in the center, where there is a number of images above the 10,000. However, when performing segmentation, the values obtained are higher due to this factor, but there is greater control of defects.

5. Conclusions and Future Work

GANs are a potent tool in machine learning, allowing us to generate synthetic data almost indistinguishable from the real thing. The competition between the generator and the discriminator networks creates a powerful feedback loop that leads to increasingly realistic synthetic data. GANs are a testament to the power of deep learning and will undoubtedly be a driving force in developing new AI technologies for years to come. The synthetic faces produced by StyleGAN will allow this technology to be used for various purposes. A very innovative application is to use these images for undercover agents in different social networks. The process of evaluating the quality of the synthetic images produced is guaranteed using a combination of metrics such as FID, IS, and LPIPS, which are the ones with the most significant coherence concerning the generated images and which, in turn, are related to what a human being would rate as good in terms of quality. The correct segmentation and preprocessing of the dataset must always be performed, because being lax with the data that are given to the model without preprocessing is common in various investigations. Proper training guidance and careful attention to the results are crucial to achieving reliable and accurate results. By recognizing these limitations and taking the necessary precautions, researchers can ensure the proper use of GANs in AI technology and avoid potentially costly mistakes.

One notable challenge is the notorious difficulty in training GANs. The generator and discriminator networks constantly compete against each other, leading to unstable and slow training processes. The literature analysis indicates that many researchers fail to filter and segment datasets correctly when using GANs. This lack of attention to data quality can harm image generation, limiting its usefulness and realism. It is crucial to recognize the importance of carefully selecting and preparing datasets to obtain optimal results when imaging with GANs.

Improving the generation of images of real people without defects and noticeable flaws requires decreasing facial variability in terms of, for example, segmenting the dataset by people with glasses and without glasses, by age segment, by male and female gender, and more advanced techniques to represent skin texture and other microscopic details. With this, it is expected that more prominent and diverse datasets will contribute to better representation and more realistic images.

Future Work

Therefore, it is imperative to focus on improving the quality of synthetic images generated by StyleGAN to create images that are indistinguishable from real photos of people. In achieving this, we can enhance the accuracy and reliability of these models, making them more useful in various applications. Future challenges include:

Define a way to use the most representative metrics. This is important because different metrics may have different strengths and weaknesses. It is crucial to select the appropriate metrics that can effectively capture the quality and diversity of the generated images. In addition, combining multiple metrics, such as FID, IS, and LPIPS, can provide a more comprehensive evaluation of the GAN models.
Implement a new way to support the existing metrics with a VAE. This approach could improve the quality of synthetic images by incorporating additional information from a VAE, which can learn a better latent representation of the images and improve the diversity of the developed images.
Define a standard dataset or the minimum features of a dataset like proper segmentation, resolution, etc. This can establish a benchmark for evaluating GAN models and enable more meaningful comparisons between different models. Having a standardized dataset with segmentation could also help to identify the strengths and weaknesses of different models in generating specific features of the images.
Implement a framework that allows us to find the best way to evaluate synthetic images. This could involve exploring different combinations of metrics and comparing their performance on a standardized dataset. It could also include investigating the effectiveness of different preprocessing methods, such as data augmentation or normalization, in improving the quality of synthetic images.
Defining a standard of minimum features and guidelines for creating a balanced dataset could also reduce the computational power used by GANs, which requires large amounts of data to generate synthetic data. Using a standardized dataset with balanced features, GANs may require less computational power to produce accurate results, which could reduce the costs and resources required for training and using these models.

Author Contributions

Conceptualization, C.N.V.M., A.L.S.O. and L.J.G.V.; methodology, C.N.V.M., A.L.S.O. and L.J.G.V.; validation, C.N.V.M., A.L.S.O. and L.J.G.V.; investigation, C.N.V.M., A.L.S.O. and L.J.G.V. All authors have read and agreed to the published version of the manuscript.

Funding

This work also was supported by the European Commission under the Horizon 2020 research and innovation programme, as part of the project HEROES (Grant Agreement no. 101021801). Views and opinions expressed are however those of the authors only and do not necessarily reflect those of the European Union or European Commission—EU. Neither the European Union nor the European Commission can be held responsible for them.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable. This study does not report any data.

Conflicts of Interest

The authors declare no conflict of interest.

References

Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative Adversarial Nets. arXiv 2014. [Google Scholar] [CrossRef]
Van Den Oord, A.; Kalchbrenner, N.; Kavukcuoglu, K. Pixel recurrent neural networks. In Proceedings of the International Conference on Machine Learning, New York, NY, USA, 20–22 June 2016; pp. 1747–1756. [Google Scholar]
Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-Resolution Image Synthesis With Latent Diffusion Models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 10684–10695. [Google Scholar]
Karras, T.; Aila, T.; Laine, S.; Lehtinen, J. Progressive Growing of GANs for Improved Quality, Stability, and Variation. In Proceedings of the 6th International Conference on Learning Representations, ICLR 2018—Conference Track Proceedings, Vancouver, BC, Canada, 30 April–3 May 2017. [Google Scholar]
Kim, C.I.; Kim, M.; Jung, S.; Hwang, E. Simplified Fréchet Distance for Generative Adversarial Nets. Sensors 2020, 20, 1548. [Google Scholar] [CrossRef]
Fu, J.; Li, S.; Jiang, Y.; Lin, K.Y.; Qian, C.; Loy, C.C.; Wu, W.; Liu, Z. StyleGAN-Human: A Data-Centric Odyssey of Human Generation. In Computer Vision—ECCV 2022; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2022; pp. 1–19. [Google Scholar]
Tian, H.; Zhu, T.; Zhou, W. Fairness and privacy preservation for facial images: GAN-based methods. Comput. Secur. 2022, 122, 102902. [Google Scholar] [CrossRef]
Gangwar, A.; González-Castro, V.; Alegre, E.; Fidalgo, E. Triple-BigGAN: Semi-supervised generative adversarial networks for image synthesis and classification on sexual facial expression recognition. Neurocomputing 2023, 528, 200–216. [Google Scholar] [CrossRef]
Karras, T.; Laine, S.; Aila, T. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 4401–4410. [Google Scholar]
Karras, T.; Aittala, M.; Hellsten, J.; Laine, S.; Lehtinen, J.; Aila, T. Training Generative Adversarial Networks with Limited Data. Adv. Neural Inf. Process. Syst. 2020, 33, 12104–12114. [Google Scholar]
Karras, T.; Aittala, M.; Laine, S.; Härkönen, E.; Hellsten, J.; Lehtinen, J.; Aila, T. Alias-free generative adversarial networks. Adv. Neural Inf. Process. Syst. 2021, 34, 852–863. [Google Scholar]
Mirza, M.; Osindero, S. Conditional generative adversarial nets. arXiv 2014, arXiv:1411.1784. [Google Scholar]
Deng, J.; Pang, G.; Zhang, Z.; Pang, Z.; Yang, H.; Yang, G. cGAN Based Facial Expression Recognition for Human-Robot Interaction. IEEE Access 2019, 7, 9848–9859. [Google Scholar] [CrossRef]
Zhao, Z.; Singh, S.; Lee, H.; Zhang, Z.; Odena, A.; Zhang, H. Improved Consistency Regularization for GANs. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 19–21 May 2021; pp. 11033–11041. [Google Scholar]
Radford, A.; Metz, L.; Chintala, S. Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks. In Proceedings of the 4th International Conference on Learning Representations, ICLR 2016—Conference Track Proceedings, San Juan, Puerto Rico, 2–4 May 2016. [Google Scholar]
Zhou, R.; Jiang, C.; Xu, Q. A survey on generative adversarial network-based text-to-image synthesis. Neurocomputing 2021, 451, 316–336. [Google Scholar] [CrossRef]
Zhu, J.; Yang, G.; Lio, P. How can we make GAN perform better in single medical image super-resolution? A lesion focused multi-scale approach. In Proceedings of the 2019 IEEE 16th International Symposium on Biomedical Imaging (ISBI 2019), Venice, Italy, 8–11 April 2019; pp. 1669–1673. [Google Scholar]
Gong, Y.; Liao, P.; Zhang, X.; Zhang, L.; Chen, G.; Zhu, K.; Tan, X.; Lv, Z. Enlighten-GAN for Super Resolution Reconstruction in Mid-Resolution Remote Sensing Images. Remote. Sens. 2021, 13, 1104. [Google Scholar] [CrossRef]
Odena, A.; Olah, C.; Shlens, J. Conditional image synthesis with auxiliary classifier gans. In Proceedings of the International Conference on Machine Learning, Sydney, Australia, 8–11 August 2017; pp. 2642–2651. [Google Scholar]
Mao, X.; Li, Q.; Xie, H.; Lau, R.Y.; Wang, Z.; Paul Smolley, S. Least squares generative adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2794–2802. [Google Scholar]
Ledig, C.; Theis, L.; Huszár, F.; Caballero, J.; Cunningham, A.; Acosta, A.; Aitken, A.; Tejani, A.; Totz, J.; Wang, Z.; et al. Photo-realistic single image super-resolution using a generative adversarial network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4681–4690. [Google Scholar]
Zhang, H.; Goodfellow, I.; Metaxas, D.; Odena, A. Self-attention generative adversarial networks. In Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; pp. 7354–7363. [Google Scholar]
Karras, T.; Laine, S.; Aittala, M.; Hellsten, J.; Lehtinen, J.; Aila, T. Analyzing and improving the image quality of stylegan. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 8110–8119. [Google Scholar]
Gulrajani, I.; Ahmed, F.; Arjovsky, M.; Dumoulin, V.; Courville, A.C. Improved training of wasserstein GANs. arXiv 2017. [Google Scholar] [CrossRef]
Yi, Z.; Zhang, H.; Tan, P.; Gong, M. Dualgan: Unsupervised dual learning for image-to-image translation. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2849–2857. [Google Scholar]
Reed, S.; Akata, Z.; Yan, X.; Logeswaran, L.; Schiele, B.; Lee, H. Generative adversarial text to image synthesis. In Proceedings of the International Conference on Machine Learning, New York, NY, USA, 20–22 June 2016; pp. 1060–1069. [Google Scholar]
You, S.; Lei, B.; Wang, S.; Chui, C.K.; Cheung, A.C.; Liu, Y.; Gan, M.; Wu, G.; Shen, Y. Fine Perceptive GANs for Brain MR Image Super-Resolution in Wavelet Domain. IEEE Trans. Neural Netw. Learn. Syst. 2022. [Google Scholar] [CrossRef] [PubMed]
Kazeminia, S.; Baur, C.; Kuijper, A.; van Ginneken, B.; Navab, N.; Albarqouni, S.; Mukhopadhyay, A. GANs for medical image analysis. Artif. Intell. Med. 2020, 109, 101938. [Google Scholar] [CrossRef] [PubMed]
Lata, K.; Dave, M.; Nishanth, K.N. Image-to-Image Translation Using Generative Adversarial Network. In Proceedings of the 2019 3rd International Conference on Electronics, Communication and Aerospace Technology (ICECA), Coimbatore, India, 12–14 June 2019. [Google Scholar]
Skandarani, Y.; Jodoin, P.M.; Lalande, A. GANs for Medical Image Synthesis: An Empirical Study. J. Imaging 2023, 9, 69. [Google Scholar] [CrossRef] [PubMed]
Li, R.; Wang, N.; Feng, F.; Zhang, G.; Wang, X. Exploring Global and Local Linguistic Representations for Text-to-Image Synthesis. IEEE Trans. Multimed. 2020, 22, 3075–3087. [Google Scholar] [CrossRef]
Vougioukas, K.; Petridis, S.; Pantic, M. End-to-End Speech-Driven Facial Animation with Temporal GANs. In Proceedings of the British Machine Vision Conference, Newcastle, UK, 3–6 September 2018. [Google Scholar]
Saidia Fascí, L.; Fisichella, M.; Lax, G.; Qian, C. Disarming visualization-based approaches in malware detection systems. Comput. Secur. 2023, 126, 103062. [Google Scholar] [CrossRef]
Perera, A.; Khayatian, F.; Eggimann, S.; Orehounig, K.; Halgamuge, S. Quantifying the climate and human-system-driven uncertainties in energy planning by using GANs. Appl. Energy 2022, 328, 120169. [Google Scholar] [CrossRef]
Min, J.; Liu, Z.; Wang, L.; Li, D.; Zhang, M.; Huang, Y. Music Generation System for Adversarial Training Based on Deep Learning. Processes 2022, 10, 2515. [Google Scholar] [CrossRef]
Sisman, B.; Vijayan, K.; Dong, M.; Li, H. SINGAN: Singing Voice Conversion with Generative Adversarial Networks. In Proceedings of the 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Lanzhou, China, 18–21 November 2019. [Google Scholar]
Wen, S.; Liu, W.; Yang, Y.; Huang, T.; Zeng, Z. Generating Realistic Videos From Keyframes With Concatenated GANs. IEEE Trans. Circuits Syst. Video Technol. 2019, 29, 2337–2348. [Google Scholar] [CrossRef]
Lucas, A.; López-Tapia, S.; Molina, R.; Katsaggelos, A.K. Generative Adversarial Networks and Perceptual Losses for Video Super-Resolution. IEEE Trans. Image Process. 2019, 28, 3312–3327. [Google Scholar] [CrossRef]
NvVLabs. GitHub–NVlabs/ffhq-dataset: Flickr-Faces-HQ Dataset (FFHQ). In FFHQ-Dataset; 2018; Available online: https://github.com/NVlabs/ffhq-dataset (accessed on 24 June 2023).
Hsu, G.S.; Xie, R.C.; Chen, Z.T.; Lin, Y.H. AgeTransGAN for Facial Age Transformation with Rectified Performance Metrics. In Computer Vision—ECCV 2022; Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Springer: Cham, Switzerland, 2022; Volume 13672, pp. 580–595. [Google Scholar]
Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. ImageNet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar]
Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; et al. ImageNet Large Scale Visual Recognition Challenge. Int. J. Comput. Vis. 2015, 115, 211–252. [Google Scholar] [CrossRef]
Liu, Z.; Luo, P.; Wang, X.; Tang, X. Deep learning face attributes in the wild. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 3730–3738. [Google Scholar]
Yu, F.; Seff, A.; Zhang, Y.; Song, S.; Funkhouser, T.; Xiao, J. LSUN: Construction of a Large-scale Image Dataset using Deep Learning with Humans in the Loop. arXiv 2015, arXiv:1506.03365. [Google Scholar]
Hukkelås, H.; Mester, R.; Lindseth, F. DeepPrivacy: A Generative Adversarial Network for Face Anonymization. In ISVC 2019; Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Springer: Cham, Switzerland, 2019; pp. 565–578. [Google Scholar]
Lucic, M.; Kurach, K.; Michalski, M.; Gelly, S.; Bousquet, O. Are GANs Created Equal? A Large-Scale Study. arXiv 2018. [Google Scholar] [CrossRef]
Shmelkov, K.; Schmid, C.; Alahari, K. How Good Is My GAN? In ECCV2018; Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Springer: Berlin/Heidelberg, Germany, 2018; pp. 218–234. [Google Scholar]
Kurach, K.; Lučić, M.; Zhai, X.; Michalski, M.; Gelly, S. A Large-Scale Study on Regularization and Normalization in GANs. In Proceedings of the 36th International Conference on Machine Learning, Long Beach, CA, USA, 10–15 June 2019; pp. 3581–3590. [Google Scholar]
Matchev, K.; Roman, A.; Shyamsundar, P. Uncertainties associated with GAN-generated datasets in high energy physics. SciPost Phys. 2022, 12, 104. [Google Scholar] [CrossRef]
Varkarakis, V.; Bazrafkan, S.; Corcoran, P. Re-Training StyleGAN-A First Step towards Building Large, Scalable Synthetic Facial Datasets. In Proceedings of the 2020 31st Irish Signals and Systems Conference, ISSC 2020, Letterkenny, Ireland, 11–12 June 2020. [Google Scholar]
Cao, Q.; Shen, L.; Xie, W.; Parkhi, O.M.; Zisserman, A. VGGFace2: A dataset for recognising faces across pose and age. In Proceedings of the 13th IEEE International Conference on Automatic Face and Gesture Recognition, FG 2018, Xi’an, China, 15–19 May 2018; pp. 67–74. [Google Scholar]
Bansal, A.; Nanduri, A.; Castillo, C.D.; Ranjan, R.; Chellappa, R. UMDFaces: An annotated face dataset for training deep networks. In Proceedings of the IEEE International Joint Conference on Biometrics, IJCB 2017, Denver, CO, USA, 1–4 October 2017; pp. 464–473. [Google Scholar]
Gross, R.; Matthews, I.; Cohn, J.; Kanade, T.; Baker, S. Multi-PIE. In Proceedings of the 2008 8th IEEE International Conference on Automatic Face and Gesture Recognition, FG 2008, Amsterdam, The Netherlands, 17–19 September 2008. [Google Scholar]
Chen, B.C.; Chen, C.S.; Hsu, W.H. Cross-age reference coding for age-invariant face recognition and retrieval. In Computer Vision—ECCV 2014; Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Springer: Berlin/Heidelberg, Germany, 2014; Volume 8694, pp. 768–783. [Google Scholar]
Yi, D.; Lei, Z.; Liao, S.; Li, S.Z. Learning face representation from scratch. arXiv 2014, arXiv:1411.7923. [Google Scholar]
Obukhov, A.; Krasnyanskiy, M. Quality Assessment Method for GAN Based on Modified Metrics Inception Score and Fréchet Inception Distance. Adv. Intell. Syst. Comput. 2020, 1294, 102–114. [Google Scholar]
Borji, A. Pros and cons of GAN evaluation measures. Comput. Vis. Image Underst. 2019, 179, 41–65. [Google Scholar] [CrossRef]
Zhang, R.; Isola, P.; Efros, A.A.; Shechtman, E.; Wang, O. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
Sajjadi, M.S.M.; Mario, B.; Google, L.; Olivier, B.; Sylvain, B.; Brain, G.G. Assessing Generative Models via Precision and Recall. arXiv 2018. [Google Scholar] [CrossRef]
Wang, Z.; Bovik, A.; Sheikh, H.; Simoncelli, E. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef]
Binkowski, M.; Sutherland, D.J.; Arbel, M.; Gretton, A. Demystifying MMD GANs. arXiv 2018. [Google Scholar] [CrossRef]
Naeem, M.F.; Oh, S.J.; Uh, Y.; Choi, Y.; Yoo, J. Reliable Fidelity and Diversity Metrics for Generative Models. In Proceedings of the 37th International Conference on Machine Learning, Virtual, 13–18 July 2020; pp. 7176–7185. [Google Scholar]
Yates, M.; Hart, G.; Houghton, R.; Torres, M.T.; Pound, M. Evaluation of synthetic aerial imagery using unconditional generative adversarial networks. ISPRS J. Photogramm. Remote. Sens. 2022, 190, 231–251. [Google Scholar] [CrossRef]
Gretton, A.; Borgwardt, K.M.; Rasch, M.J.; Smola, A.; Schölkopf, B.; GRETTON, A.S. A Kernel Two-Sample Test. J. Mach. Learn. Res. 2012, 13, 723–773. [Google Scholar]
Heusel, M.; Ramsauer, H.; Unterthiner, T.; Nessler, B.; Hochreiter, S. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. arXiv. 2017. [Google Scholar] [CrossRef]
Salimans, T.; Goodfellow, I.; Zaremba, W.; Cheung, V.; Radford, A.; Chen, X.; Chen, X. Improved Techniques for Training GANs. arXiv 2016. [Google Scholar] [CrossRef]
Krizhevsky, A.; Nair, V.; Hinton, G. The CIFAR-10 Dataset. 2014. Available online: http://www.cs.Toronto.edu/kriz/cifar.html (accessed on 24 June 2023).
Li, H.; Li, B.; Tan, S.; Huang, J. Identification of deep network generated images using disparities in color components. Signal Process. 2020, 174, 107616. [Google Scholar] [CrossRef]
Tang, S. Lessons Learned from the Training of GANs on Artificial Datasets. IEEE Access 2020, 8, 165044–165055. [Google Scholar] [CrossRef]
Pasquini, C.; Laiti, F.; Lobba, D.; Ambrosi, G.; Boato, G.; Natale, F.D. Identifying Synthetic Faces through GAN Inversion and Biometric Traits Analysis. Appl. Sci. 2023, 13, 816. [Google Scholar] [CrossRef]
Wu, C.; Li, H. Conditional Transferring Features: Scaling GANs to Thousands of Classes with 30% Less High-Quality Data for Training. In Proceedings of the 2020 International Joint Conference on Neural Networks (IJCNN), Glasgow, UK, 19–24 July 2020; pp. 1–8. [Google Scholar]
Bermano, A.; Gal, R.; Alaluf, Y.; Mokady, R.; Nitzan, Y.; Tov, O.; Patashnik, O.; Cohen-Or, D. State-of-the-Art in the Architecture, Methods and Applications of StyleGAN. Comput. Graph. Forum 2022, 41, 591–611. [Google Scholar] [CrossRef]
Yazıcı, Y.; Foo, C.S.; Winkler, S.; Yap, K.H.; Piliouras, G.; Chandrasekhar, V. The Unusual Effectiveness of Averaging in GAN Training. In Proceedings of the 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
Karim, A.A.; Saleh, S.M. Face Image Animation with Adversarial Learning and Motion Transfer. Int. J. Interact. Mob. Technol. (iJIM) 2022, 16, 109–121. [Google Scholar] [CrossRef]
Dhirani, L.L.; Mukhtiar, N.; Chowdhry, B.S.; Newe, T. Ethical Dilemmas and Privacy Issues in Emerging Technologies: A Review. Sensors 2023, 23, 1151. [Google Scholar] [CrossRef]
Voigt, P. The EU General Data Protection Regulation (GDPR): A Practical Guide (Article 32). GDPR 2018, 10, 10–5555. [Google Scholar]
Brock, A.; Donahue, J.; Simonyan, K. Large Scale GAN Training for High Fidelity Natural Image Synthesis. arXiv 2018, arXiv:1809.11096. [Google Scholar]
Kora Venu, S.; Ravula, S. Evaluation of Deep Convolutional Generative Adversarial Networks for Data Augmentation of Chest X-ray Images. Future Internet 2021, 13, 8. [Google Scholar] [CrossRef]
Feng, Q.; Guo, C.; Benitez-Quiroz, F.; Martinez, A.M. When do gans replicate? On the choice of dataset size. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 6701–6710. [Google Scholar]
Bengio, Y.; Simard, P.; Frasconi, P. Learning long-term dependencies with gradient descent is difficult. IEEE Trans. Neural Netw. 1994, 5, 157–166. [Google Scholar] [CrossRef]
Fei, H.; Tan, F. Bidirectional Grid Long Short-Term Memory (BiGridLSTM): A Method to Address Context-Sensitivity and Vanishing Gradient. Algorithms 2018, 11, 172. [Google Scholar] [CrossRef]
Zhang, Z.; Li, M.; Yu, J. On the Convergence and Mode Collapse of GAN. In Proceedings of the SIGGRAPH Asia 2018 Technical Briefs, New York, NY, USA, 4–7 December 2018. [Google Scholar]
Grnarova, P.; Levy, K.Y.; Lucchi, A.; Perraudin, N.; Goodfellow, I.; Hofmann, T.; Krause, A. A Domain Agnostic Measure for Monitoring and Evaluating GANs. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; Volume 32. [Google Scholar]
Xu, Q.; Huang, G.; Yuan, Y.; Guo, C.; Sun, Y.; Wu, F.; Weinberger, K. An empirical study on evaluation metrics of generative adversarial networks. arXiv 2018, arXiv:1806.07755. [Google Scholar]
Alfarra, M.; Pérez, J.C.; Frühstück, A.; Torr, P.H.S.; Wonka, P.; Ghanem, B. On the Robustness of Quality Measures for GANs. In Proceedings of the Computer Vision—ECCV 2022, Tel Aviv, Israel, 23–27 October 2022; pp. 18–33. [Google Scholar]
Alaluf, Y.; Patashnik, O.; Cohen-Or, D. ReStyle: A Residual-Based StyleGAN Encoder via Iterative Refinement. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 6711–6720. [Google Scholar]
Aggarwal, A.; Mittal, M.; Battineni, G. Generative adversarial network: An overview of theory and applications. Int. J. Inf. Manag. Data Insights 2021, 1, 100004. [Google Scholar] [CrossRef]
Yu, Y.; Li, X.; Liu, F. Attention GANs: Unsupervised Deep Feature Learning for Aerial Scene Classification. IEEE Trans. Geosci. Remote. Sensing 2020, 58, 519–531. [Google Scholar] [CrossRef]
Tan, W.R.; Chan, C.S.; Aguirre, H.E.; Tanaka, K. ArtGAN: Artwork synthesis with conditional categorical GANs. In Proceedings of the 2017 IEEE International Conference on Image Processing (ICIP), Beijing, China, 17–20 September 2017. [Google Scholar]
Sauer, A.; Schwarz, K.; Geiger, A. Stylegan-xl: Scaling StyleGAN to large diverse datasets. In Proceedings of the ACM SIGGRAPH 2022 Conference Proceedings, Vancouver, BC, Canada, 7–11 August 2022; pp. 1–10. [Google Scholar]
Pranoto, H.; Heryadi, Y.; Warnars, H.L.H.S.; Budiharto, W. Enhanced IPCGAN-Alexnet model for new face image generating on age target. J. King Saud Univ. Comput. Inf. Sci. 2022, 34, 7236–7246. [Google Scholar] [CrossRef]
Shen, Y.; Yang, C.; Tang, X.; Zhou, B. InterFaceGAN: Interpreting the Disentangled Face Representation Learned by GANs. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 44, 2004–2018. [Google Scholar] [CrossRef]
Zuo, Z.; Zhao, L.; Li, A.; Wang, Z.; Chen, H.; Xing, W.; Lu, D. Dual distribution matching GAN. Neurocomputing 2022, 478, 37–48. [Google Scholar] [CrossRef]
Son, C.H.; Jeong, D.H. Heavy Rain Face Image Restoration: Integrating Physical Degradation Model and Facial Component-Guided Adversarial Learning. Sensors 2022, 22, 5359. [Google Scholar] [CrossRef] [PubMed]
Li, Y.; Gan, Z.; Shen, Y.; Liu, J.; Cheng, Y.; Wu, Y.; Carin, L.; Carlson, D.; Gao, J. Storygan: A sequential conditional gan for story visualization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 6329–6338. [Google Scholar]
Zhang, M.; Ling, Q. Supervised Pixel-Wise GAN for Face Super-Resolution. IEEE Trans. Multimed. 2021, 23, 1938–1950. [Google Scholar] [CrossRef]
Yao, X.; Newson, A.; Gousseau, Y.; Hellier, P. A Style-Based GAN Encoder for High Fidelity Reconstruction of Images and Videos. In Proceedings of the Computer Vision—ECCV 2022: 17th European Conference, Tel Aviv, Israel, 23–27 October 2022; pp. 581–597. [Google Scholar]

Figure 1. GAN architecture variants.

Figure 2. The frequency of the main architectures of the literature review.

Figure 3. A general overview of application domains of GANs.

Figure 4. The frequency of main datasets in the literature review.

Figure 5. The frequency of main metrics in the literature review.

Figure 6. A selection of synthetic images produced using StyleGAN2 with an FID of 2.84 and StyleGAN3 with an FID of 2.79, exhibiting inherent imperfections or flaws.

Figure 7. The first row corresponds to the 30–39 segment, the second row to 40–49, and finally to 50–69 years of age.

Table 1. Features of datasets used.

Dataset—Ref.	Number of Images	Res. (px)	Format	Classes
CelebA [43]	202,599	178 × 218	JPG	There are no categorical labels, but there are 40 binary attributes describing facial features.
FFHQ [9]	70,000	1024 × 1024	PNG	There are no categorical labels, but there are 10 continuous attributes describing facial features, such as age, gender, pose, expression, and others.
LSUN [44]	10,000,000	VR	JPG	There are several categories, including bedrooms, churches, kitchens, living rooms, and buildings.
VGGFace2 [51]	3,321,053	VR	JPG	There are over 8000 famous and non-famous people classes.
UMDFaces [52]	367,888	VR	JPG	There are several categories, including age, gender, race, and emotions.
Multi-PIE [53]	750,000	640 × 490	BMP	There are several categories, including lighting, poses, and facial expressions.
CACD [54]	163,446	VR	JPG	Age of people in the images.
CASI-WebFace [55]	494,414	VR	JPG	There are over 10,000 classes of famous and non-famous people.
LFW [50]	13,233	VR	JPG	There are over 5000 classes of famous and non-famous people.

Table 2. The intersection between the top 3 architectures, datasets, and metrics.

		Datasets			Metrics
References	Architecture	CIFAR-10	CelebA	FFHQ	FID	IS	SSIM	PSNR
[8,14,68,69]	BigGAN	✓ ¹	✓	✓	✓	✓
[50,63,68,70]	StyleGAN2		✓	✓	✓	✓
[6,71,72]	StyleGAN	✓		✓	✓	✓
[48,68]	DCGAN	✓	✓	✓	✓	✓
[56,73,74]	GAN	✓			✓	✓	✓	✓
[17,47,69]	WGAN-GP	✓			✓	✓	✓	✓

¹ The main GAN architectures that use the main datasets and metrics found in the literature review carried out.

Table 3. Succinct outline of the metrics, datasets, and architectures employed.

Ref.	Architecture	Datasets	Metrics
[73]	GAN	ImageNet, CIFAR 10, STL 10	FID, IS, MA, EMA
[47]	WGAN-GP	CIFAR 10, CIFAR 100, ImageNet, MNIST	FID, IS, GAN-Train, GAN-Test
[48]	DCGAN	CIFAR 10, CelebA-HQ, LSUN Bedroom	FID, IS, KID, MS-SSIM
[95]	StoryGAN	CIFAR 10, CelebA-HQ, LSUN Bedroom	MS-SSIM
[17]	WGAN, WPGAN, MSGAN	BraTS	SSIM, PSNR
[56]	GAN	MNIST, HAR, EST	FID, IS
[13]	WGAN, cGAN	AffectNet, RAF-DB	W-DISTANCE
[69]	WGAN-GP, BigGAN	CIFAR 10	FID, IS, W-DISTANCE
[68]	DCGAN, WGAN, ProGAN, StyleGAN2, BigGAN, CoCoGAN	FFHQ, CelebA, LSUN	RGB, HSV, YCbCr
[50]	StyleGAN2	CelebA, CASIA WebFace	FID, IS
[71]	StyleGAN	CIFAR 10, STL 10, ImageNet, CASIA HWDB1.0	FID, IS
[14]	VanillaGAN, SNDCGAN, BigGAN	CIFAR 10	FID
[18]	Enlighten-GAN	Sentinel-2	GSM, PI, PSNR, LPIPS
[96]	SPGAN	VGGFace2, CelebA, Helen, LFW, CFP-FP, AgeDB-30	FID, PSNR, SSIM
[63]	PCGAN, StyleGAN2, CoCoGAN	Inria Aerial	FID, KID
[7]	PF-GAN	Celeb-HD	Similarity Score
[74]	GAN	VoxCeleb	PSNR, SSIM
[6]	StyleGAN	SHHQ	FID
[94]	FCG-GAN	CelebA	PSNR, SSIM
[90]	StyleGAN-XL	FFHQ, Pokemon	FID, IS
[93]	DDM-GAN	MNIST	FID, KL-divergence
[97]	StyleGAN3	CelebA-HQ	FID, LPIPS, SSIM, PSNR, MSE
[70]	StyleGAN2	CelebA, LFW, FFHQ, Caltech	LPIPS, MSE
[8]	Triple-BigGAN, BIG	SEA Faces	LPIPS, MSE
[72]	StyleGAN	FFHQ	FID, IS, LPIPS, MS-SSIM
[49]	InterFaceGAN	CelebA	KL-divergence
[92]	PCGAN, InterFaceGAN	CelebA-HD	ACRD, ACSD, SBC
[40]	AgeTransGAN	FFHQ-Aging	FID

Table 4. Comparison of training sessions carried out with different age segments.

Segment	Size	Average FID ↓ ²	Min FID	±% FID
0–2	2492	27.64	14.81	0
70+	1812	22.30	15.47	19.32
10–14	2235	19.94	16.09	10.61
3–6	4523	18.85	13.00	5.46
7–9	2858	18.30	14.51	2.90
15–19	4022	13.51	11.89	26.18
50–69	7726	9.62	6.83	28.79
40–49	9678	8.58	7.02	10.80
20–29	19,511	8.13	6.56	5.25
30–39	15,143	7.73	6.35	4.89

² Ordered by FID from worst to best.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Valdebenito Maturana, C.N.; Sandoval Orozco, A.L.; García Villalba, L.J. Exploration of Metrics and Datasets to Assess the Fidelity of Images Generated by Generative Adversarial Networks. Appl. Sci. 2023, 13, 10637. https://doi.org/10.3390/app131910637

AMA Style

Valdebenito Maturana CN, Sandoval Orozco AL, García Villalba LJ. Exploration of Metrics and Datasets to Assess the Fidelity of Images Generated by Generative Adversarial Networks. Applied Sciences. 2023; 13(19):10637. https://doi.org/10.3390/app131910637

Chicago/Turabian Style

Valdebenito Maturana, Claudio Navar, Ana Lucila Sandoval Orozco, and Luis Javier García Villalba. 2023. "Exploration of Metrics and Datasets to Assess the Fidelity of Images Generated by Generative Adversarial Networks" Applied Sciences 13, no. 19: 10637. https://doi.org/10.3390/app131910637

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Exploration of Metrics and Datasets to Assess the Fidelity of Images Generated by Generative Adversarial Networks

Abstract

1. Introduction

2. What Is a Generative Adversarial Network?

2.1. Exploring the Process

2.2. Types and Applications of Generative Adversarial Networks

2.2.1. Types of Generative Adversarial Networks

2.2.2. Applications of Generative Adversarial Networks for the Creation of Synthetic Images

2.3. Main Data Repositories Utilized

2.4. Main Metrics Used

2.5. Advantages and Constrains of Generative Adversarial Networks

2.5.1. Advantages

2.5.2. Constraints

3. Advancements in Generative Adversarial Network Research

3.1. Unsupervised Generative Adversarial Networks

3.2. Semi-Supervised Generative Adversarial Networks

3.3. Supervised Generative Adversarial Networks

4. Experiments

5. Conclusions and Future Work

Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI