Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
survey
Open access

Deep Learning-based Face Super-resolution: A Survey

Published: 23 November 2021 Publication History

Abstract

Face super-resolution (FSR), also known as face hallucination, which is aimed at enhancing the resolution of low-resolution (LR) face images to generate high-resolution face images, is a domain-specific image super-resolution problem. Recently, FSR has received considerable attention and witnessed dazzling advances with the development of deep learning techniques. To date, few summaries of the studies on the deep learning-based FSR are available. In this survey, we present a comprehensive review of deep learning-based FSR methods in a systematic manner. First, we summarize the problem formulation of FSR and introduce popular assessment metrics and loss functions. Second, we elaborate on the facial characteristics and popular datasets used in FSR. Third, we roughly categorize existing methods according to the utilization of facial characteristics. In each category, we start with a general description of design principles, present an overview of representative approaches, and then discuss the pros and cons among them. Fourth, we evaluate the performance of some state-of-the-art methods. Fifth, joint FSR and other tasks, and FSR-related applications are roughly introduced. Finally, we envision the prospects of further technological advancement in this field.

1 Introduction

Face super-resolution (FSR), a domain-specific image super-resolution problem, refers to the technique of recovering high-resolution (HR) face images from low-resolution (LR) face images. It can increase the resolution of an LR face image of low quality and recover the details. In many real-world scenarios, limited by physical imaging systems and imaging conditions, the face images are always low quality. Thus, with a wide range of applications and notable advantages, FSR has always been a hot topic since its birth in image processing and computer vision.1
The concept of FSR was first proposed in 2000 by Baker and Kanade [8], who are the pioneers of the FSR technique. They develop a multi-level learning and prediction model based on the Gaussian image pyramid to improve the resolution of an LR face image. Liu et al. [9] propose to integrate a global parametric principal component analysis (PCA) model with a local nonparametric Markov random field model for FSR. Since then, a number of innovative methods have been proposed, and FSR has become the subject of active research efforts. Researchers super-resolve the LR face images by means of global face statistical models [10, 11, 12, 13, 14, 15, 16], local patch-based representation methods [17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27], or hybrid ones [28, 29]. These methods have achieved good performance; however, they have trouble when meeting requirements in practice. With the rapid development of deep learning technique, attractive advantages over previous attempts have been obtained and have been applied into image or video super-resolution. Many comprehensive surveys have reviewed recent achievements in these fields, i.e., general image super-resolution surveys [30, 31, 32] and video super-resolution survey [33]. Toward FSR, a domain-specific image super-resolution, a few surveys are listed in Table 1. In the early stage of research, References [1, 2, 3, 4, 5, 6] provide a comprehensive review of traditional FSR methods (mainly including patch-based super-resolution, PCA-based methods, etc.), while Liu et al. [7] offer a generative adversarial network– (GAN) based FSR survey. However, so far no literature review is available on deep learning super-resolution specifically for human faces. In this article, we present a comparative study of different deep learning-based FSR methods.
Table 1.
No.Survey titleYearVenue
1A survey of face hallucination [1]2012CCBR
2A comprehensive survey to face hallucination [2]2014IJCV
3A review of various approaches to face hallucination [3]2015ICACTA
4Face super resolution: a survey [4]2017IJIGSP
5Super-resolution for biometrics: a comprehensive survey [5]2018PR
6Face hallucination techniques: a survey [6]2018CICT
7Survey on GAN-based face hallucination with its model development [7]2019IET
Table 1. Summary of Face Super-resolution Surveys since 2010
The main contributions of this survey are as follows:
The survey provides a comprehensive review of recent techniques for FSR, including problem definition, commonly used evaluation metrics and loss functions, the characteristics of FSR, benchmark datasets, deep learning-based FSR methods, performance comparison of states of the art, methods that jointly perform FSR and other tasks, and FSR-related applications.
The survey summarizes how existing deep learning-based FSR methods explore the potential of network architecture and take advantage of the characteristics of face images, as well as compare the similarities and differences among these methods.
The survey discusses the challenges and envisions the prospects of future research in the FSR field.
In the following, we will cover the existing deep learning-based FSR methods, and Figure 1 shows the taxonomy of FSR. Section 2 introduces the problem definition of FSR, and commonly used assessment metrics and loss functions. Section 3 presents the facial characteristics (i.e., prior information, attribute information, and identity information) and reviews some mainstream face datasets. In Section 4, we discuss FSR methods. To avoid exhaustive enumeration and take facial characteristics into consideration, FSR methods are categorized according to facial characteristics used. In Section 4, five major categories are presented: general FSR methods, prior-guided FSR methods, attribute-constrained FSR methods, identity-preserving FSR methods, and reference FSR methods. Depending on the network architecture or the utilization of facial characteristics, every category is further divided into several subcategories. Moreover, Section 4 compares the performance of some state-of-the-art methods. Besides, Section 4 also reviews some methods dealing with joint tasks and FSR-related applications. Section 5 concludes the FSR and further discusses the limitations as well as envisions the prospects of further technological advancement.
Fig. 1.
Fig. 1. The taxonomy of face super-resolution.

2 Background

2.1 Problem Definition

FSR focuses on recovering the corresponding HR face image from an observed LR face image. The image degradation model can be mathematically written as
(1)
where represents the model parameters including blurring kernel, downsampling operation, and noise, is the observed LR face image, and is the original HR face image. FSR is devoted to simulating the inverse process of the degradation model and recovers the from , which can be expressed as
(2)
where F is the super-resolution model (inverse degradation model), represents the parameters of F, and represents the super-resolved result. The optimization of can be defined as
(3)
where represents the loss between and and is the optimal parameter of the trained model. In FSR, mean square error (MSE) loss and loss are the most popular loss functions, and some models tend to use a combination of multiple loss functions, which will be reviewed in Section 2.2.
The degradation model and parameters are all unavailable in a real-world environment, and is the only given information. To simulate the image degradation process, researchers tend to use mathematical models to generate some LR and HR pairs to train the model. The simplest mathematical model is
(4)
where denotes the downsampling operation and s is the scaling factor. However, this pattern is too simple to match the real-world degradation process. To better mimic the real degradation process, researchers design a degradation process with the combination of many operations (e.g., downsampling, blur, noise, and compression) as follows:
(5)
where k is the blurring kernel, represents the convolutional operation, n denotes the noise, and J denotes the image compression. Various combinations of different operations are used in FSR. They include the widely used bicubic model [34, 35, 36], as well as the general degradation model used for blind FSR [37, 38, 39]. However, they are not introduced in detail in this survey.

2.2 Assessment Metrics and Loss Functions

In deep learning-based FSR methods, the loss function, which measures the difference between and , plays an important role in guiding the network training. Upon acquiring the trained network, the reconstruction performance of these methods can be evaluated by the assessment metrics. The preferences of different loss functions are different. For example, loss tends to produce the result that is faithful to the original image (high Peak Signal-to-Noise Ratio (PSNR) value), and the perceptual and adversarial losses will generate subjectively pleasing results (low Fréchet Inception Distance (FID) [40] and Learned Perceptual Image Patch Similarity (LPIPS) [41] values). In practice, we can choose the appropriate loss function according to the needs. Considering the relationship between loss functions and assessment metrics, we introduce them together in this section.

2.2.1 Image Quality Assessment.

Generally, two main methods of quality evaluation are subjective and objective evaluation. Subjective evaluation relies on the judgement of humans, and tends to invite readers or interviewers to see and assess the quality of the generated images, leading to results always consistent with human perception but time-assuming, inconvenient and expensive. In contrast, the objective evaluation mainly utilizes statistical data to reflect the quality of the generated images. In general, the objective evaluation methods usually produce different results from subjective evaluation metrics, because the starting point of objective evaluation methods is mathematics instead of human visual perception, which leaves the assessment image quality in dispute. Here, we introduce some popular assessment metrics.
PSNR: PSNR is a commonly used objective assessment metric in FSR. Given and , the MSE between them is first calculated, and then the PSNR is obtained,
(6)
(7)
where h, w, and c denote the height, width, and channel of the image and is the maximum possible pixel value (i.e., 255 for 8-bit images). The smaller the pixelwise difference of the two images, the higher the PSNR. In this pattern, PSNR focuses on the distance between every pair of pixels in two images, which is inconsistent with human perception, resulting in poor performance when human perception is more important.
Structural Similarity Index (SSIM): SSIM [42] is also a popular objective assessment metric that measures the structural similarity between two images. To be specific, SSIM measures similarity from three aspects: luminance, contrast, and structure. Given and , SSIM is obtained by
(8)
where , , and denote the similarity of the luminance, contrast, and structure. SSIM varies from 0 to 1. The higher the structural similarity of the two images, the larger the SSIM. Considering the uneven distribution of the image, SSIM is not reliable enough. Thus, multi-scale structural similarity index measure (MS-SSIM) [43] is proposed, which divides the image into multiple windows, first assesses SSIM for every window separately, and then converges them to obtain MS-SSIM.
LPIPS: LPIPS [41] measures the distance between two images in a deep feature space. LPIPS is more in line with human judgement than PSNR and SSIM. The more similar the two images, the smaller the LPIPS.
FID: In contrast to PSNR and SSIM, FID [40] focuses on the difference between and in a distribution-wise manner, and it is always applied to assess the visual quality of face images. The better the visual quality, the smaller the FID.
Natural Image Quality Evaluator (NIQE): NIQE [44] is a no-reference metric that measures the distance between two multivariate Gaussian models fitting natural images and the evaluated images without ground truth images. Specifically, the fitting of multivariate Gaussian model is based on the quality-aware features derived from the natural scene statistic model. The better the visual quality, the smaller the NIQE.
Mean Opinion Score (MOS): MOS is a commonly used subjective assessment metric, in contrast to the above objective quantitative metrics. To obtain the MOS, human raters are asked to assign perceptual quality scores to the tested images. Finally, MOS is obtained by calculating the arithmetic mean ratings assigned by human raters. When the number of human raters is small, MOS would be biased while MOS would be faithful enough when the number of human raters is large.

2.2.2 Loss Functions.

Initially, pixelwise loss (also known as MSE loss) is popular; however, researchers then find that models based on loss tend to generate smooth results. Then many kinds of loss functions are employed, such as pixelwise loss, SSIM loss, perceptual loss, adversarial loss, and so on.
Pixelwise Loss: Pixelwise loss measures the distance between the two images at pixel level, including loss that calculates the mean absolute error, loss that calculates the mean square error, Huber loss [45], and Carbonnier penalty function [46]. With the constrain of the pixelwise loss, the obtained can be close enough to the on the pixel value. From the definition, loss is sensitive to large errors but indifferent to small errors, while loss treats them equally. Therefore, loss has advantages in improving the performance and convergence over loss. Overall, pixelwise loss can force the model to improve PSNR, but the generated images are always over-smooth and lack high-frequency details.
SSIM Loss: Similar to pixelwise loss, SSIM loss is designed to improve the structure similarity between super-resolved image and the original HR one:
(9)
where denotes the function of SSIM. Except for SSIM loss, multi-scale SSIM loss can calculate SSIM loss at different scales.
Perceptual Loss: To improve the perceptual quality, one solution is to minimize the perceptual loss:
(10)
where is the pretrained network and l is the lth layer. In essence, the perceptual loss measures the distance between the features extracted from (e.g., VGG [47]), and it can evaluate the difference at the semantic level. Perceptual loss encourages the network to generate that is more perceptually similar to . The predicted by the model with perceptual loss always looks more pleasant but usually has lower PSNR than those pixelwise loss-based methods.
Adversarial Loss: Adversarial loss, proposed in GAN [48], is also widely used in FSR. For details, GAN is composed of two models: a generator (G) and a discriminator (D). In FSR, GAN can be described as follows: G is the super-resolution model that generates the super-resolved face with an LR face image as input, and D discriminates whether the output result is generated or real. In the training phase, G and D are trained alternatively. Early methods [34, 49] use cross entropy-based adversarial loss expressed as follows:
(11)
(12)
where and denotes the loss function of D and G, respectively, denotes the function of D, and is randomly sampled from HR training samples. However, the model trained with this adversarial loss is always unstable and may cause model collapse. Therefore, Wasserstein GAN [50] and WGAN-GP [51] are proposed to alleviate the training difficulties. The model trained with adversarial loss tends to introduce artificial details, leading to worse PSNR and SSIM but pleasing visual quality with smaller FID.
Cycle Consistency Loss: Cycle consistency loss is proposed by CycleGAN [52]. In CycleGAN-based FSR, two cooperated models are used: A super-resolution model super-resolves the to recover the , and a degradation model downsamples the back to . In turn, the degradation model downsamples the HR face image to obtain , and then the super-resolution model recovers the to generate . The cycle consistent loss is aimed to keep the consistency between () and (),
(13)
In addition to the above loss functions, many other loss functions are also used in FSR, including style loss [53], feature match loss [54], and so on. Due to the limitation of space, we do not introduce them in detail.

3 Characteristics of Face Images

Human face is a highly structured object with its own unique characteristics, which can be explored and utilized in FSR task. In this section, we simply introduce these facial characteristics.

3.1 Prior Information

As shown in Figure 2, structural priors can be found in face images, such as facial landmarks, facial heatmaps, and facial paring maps.
Fig. 2.
Fig. 2. Facial characteristics.
Facial landmarks: These locate the key points of facial components. The number of landmarks varies in different datasets, such as CelebA [55], which provides five landmarks while Helen [56] offers 194 landmarks.
Facial heatmaps: These are generated from facial landmarks. Facial landmarks give accurate points of the facial components, while heatmaps give the probability of the point being a facial landmark. To generate the heatmaps, every landmark is represented by a Gaussian kernel centered on the location of the landmark.
Facial parsing maps: These are semantic segmentation maps of face images separating the facial components from face images, including eyes, nose, mouth, skin, ears, hair, and others.
These face structure prior information can provide the location of facial components and facial structure information. We can expect to recover more reasonable target face images if we incorporate these prior knowledge to regularize or guide the FSR models.

3.2 Attribute Information

Second, the attributes, such as gender, hair color, and others, are the affiliated features of face images and can be seen as semantic-level information. In FSR, because of one-to-many maps from LR images to HR ones, the recovered face image may contain artifacts and even wrong attributes. For example, the face in the recovered result does not wear but the ground truth wears eyeglasses. At this time, attribute information can remind the network which attribute should be covered in the result. From a different perspective, attribute information also contains facial details. Taking eyeglasses as an example, the attribute of wearing eyeglasses provides the details of the facial eyes. We provide a concise example of attribute information in Figure 2. Moreover, these attributes are always binary in the face dataset, 1 denotes that the face image has the attribute, while 0 means there is no such information.

3.3 Identity Information

Third, every face image corresponds to a person, which is enabled by identity information. This type of information is always used for keeping the identity consistency between the super-resolved result and the ground truth. On the one hand, the person should not be changed after super-resolution visually. On the other hand, FSR should facilitate the performance of face recognition. Similar to attribute information, identity also offers high-level constraints to the FSR task and is beneficial to face restoration.

3.4 Datasets for FSR

In recent years, many face image datasets are used for FSR, which differ in many aspects, e.g., the number of samples, facial characteristics contained, and others. In Table 2, we list a number of commonly used face image datasets and simply indicate their amount and the facial characteristics offered. For parsing maps and identity, we only present whether they are provided or not, while for attributes and landmarks, we offer the specific amount. Aside from these datasets, many other face datasets are used in FSR, including CACD200 [66], VGGFace2 [67], UMDFaces [68], CASIA-WebFace [69], and others. It is worth noting that all above-mentioned datasets only provide HR face images. If we want to use them for training and evaluating any super-resolution model, then we need to generate the corresponding LR face images using the degradation model introduced in Section 2.
Table 2.
DatasetNumber#Attributes#LandmarksParsing mapsIdentity
CelebA [55]202,599405×
CelebAMask-HQ [57]30,000×××
Helen [56]2,330×194×
FFHQ [58]70,000×68××
AFLW [59]25,993×21××
300W [60]3,837×68××
LS3D-W [61]230,000×68××
Menpo [62]9,000×68××
LFW [63]13,23373××
LFWA [64]13,23340××
VGGFace [65]3,310,000×××
Table 2. Summary of Public Face Image Datasets for FSR

4 FSR Methods

At present, various deep learning FSR methods have been proposed. On the one hand, these methods tap the potential of the efficient network for FSR regardless of facial characteristics, i.e., developing a basic convolution neural network (CNN) or GAN for face reconstruction. On the other hand, some approaches focus on the utilization of facial characteristics, e.g., using structure prior information to facilitate face restoration and so on. Furthermore, some recently proposed models introduce additional high-quality reference face images to assist the restoration. Here, according to the type of face image special information used, we divide FSR methods into five categories: general FSR, prior-guided FSR, attribute-constrained FSR, identity-preserving FSR, and reference FSR. In this section, we concentrate on every kind of FSR method and introduce each category in detail.

4.1 General FSR

General FSR methods mainly focus on designing an efficient network and exploit the potential of efficient network structure for FSR without any facial characteristics. In the early days, most of these methods are based on CNN and incorporate various advanced architectures (including back projection, residual network, spatial or channel attention, and so on), to improve the representation ability of the network. Since then, many FSR methods by using advanced networks have been proposed. We divide general FSR methods into four categories: basic CNN-based methods, GAN-based methods, reinforcement learning-based methods, and ensemble learning-based methods. Aiming to present a clear and concise overview, we summarize the general FSR methods in Figure 3.
Fig. 3.
Fig. 3. Overview of general FSR methods.

4.1.1 Basic CNN-based Methods.

Inspired by the pioneer deep learning general image super-resolution method [70], some researchers also propose to incorporate the CNN network into the FSR task. Depending on whether they consider the global information and local differences, we can further divide the basic CNN-based methods into three categories: global methods that feed the entire face into the network and recover face images globally, local methods that divide face images into different components and then recover them, and mixed methods that recover face images locally and globally.
Global Methods: In the early years, researchers treat a face image as a whole and recover it globally. Inspired by the strong representative ability of CNN, bi-channel convolutional neural network [71, 72] directly learn a mapping from LR face images to HR ones. Then, benefiting from the performance gain of iterative back projection (IBP) in general image super-resolution, Huang et al. [73] introduce IBP to FSR as an extra post-processing step, developing the super-resolution using deep convolutional networks– (SRCNN) IBP method. After that, the thought of back projection is generally used in FSR [74, 75]. Later, channel and spatial attention mechanisms greatly improve the general image super-resolution methods, which inspires researchers to explore their utilization in FSR. Thus, a number of innovative methods integrating the attention mechanism are proposed [76, 77, 78]. In these works, two representative methods are E-ComSupResNet [77] that introduces a channel attention mechanism and SPARNet [78], which has a well-designed spatial attention for FSR. Besides that, many researchers design the cascaded model and exploit multi-scale information to improve the restoration performance [79, 80, 81].
It is observed that super-resolution in the image domain produces smooth results without high-frequency detail. Considering that wavelet transform can represent the textural and contextual information of the images, WaSRNet [82, 83] transform face images into wavelet coefficients and super-resolve the face images in the wavelet coefficient domain to avoid over-smooth results.
Local Methods: Global methods can capture global information but cannot well recover the face details. Thus, local methods are developed to recover different parts of a face image differently. Super-resolution technique based on definition-scalable inference (SRDSI) [84] decomposes the face into a basic face with low-frequency and a compensation face with high-frequency through PCA. Then, SRDSI recovers the basic face and the compensation face with very deep convolutional network (VDSR) [85] and sparse representation respectively. Finally, the two recovered faces are fused. After that, many patch-based methods have been proposed [86, 87, 88], all of which divide face images into several patches and train models for recovering the corresponding patches.
Mixed Methods: Considering that global methods can capture global structure but ignore local details while local methods focus on local details but lose global structure, a line of research naturally combines global and local methods for capturing global structure and recovering local details simultaneously. At first, global-local network [89, 90] develop a global upsampling network to model global constraints and a local enhancement network to learn face-specific details. To simultaneously capture global clues and recover local details, dual-path deep fusion network [91] constructs two individual branches for learning global facial contours and local facial component details, and then fuses the result of the two branches to generate the final SR result.

4.1.2 GAN-based Methods.

Compared with CNN-based methods that utilize pixelwise loss and generate smooth face images, GAN, first proposed by Goodfellow et al. [48], which can be applied to generate realistic-looking face images with more details, inspires researchers to design GAN-based methods. At first, researchers focus on designing various GANs to learn from paired or unpaired data. In recent years, how to utilize a pretrained generative model to boost FSR has attracted increasing attention. Therefore, GAN-based methods can be divided into general GAN-based methods and generative prior-based methods.
General GAN-based Methods: In the early stage, Yu et al. [34] develop ultra-resolving face images by discriminative generative networks (URDGN), which consists of two subnetworks: a discriminative model to distinguish a real HR face image or an artificially super-resolved output and a generative model to generate SR face images to fool the discriminative model and match the distribution of HR face images. MLGE [92] not only designs discriminators to distinguish face images but also applies edge maps of the face images to reconstruct HR face images. Recently, HiFaceGAN [93] and the works of [94, 95, 96, 97] also super-resolve face images with generative models. Instead of directly feeding the whole face images into the discriminator, PCA-SRGAN [98] decomposes face images into components by PCA and progressively feeds increasing components of the face images into the discriminator to reduce the learning difficulty of the discriminator. The commonality of these types of GAN is that the discriminator outputs a single probability value to characterize whether the result is a real face image. However, Zhang et al. [99] assume that a single probability value is too fragile to represent a whole image, thus they design a supervised pixelwise GAN (SPGAN) whose discriminator outputs a discriminative matrix with the same resolution as the input images, and design a supervised pixelwise adversarial loss, thus recovering more photo-realistic face images.
The above methods rely on the artificial LR and HR pairs generated by a known degradation. However, the quality of the real-world LR image is affected by a wide range of factors such as the imaging conditions and the imaging system, leading to the complicated unknown degradation of real LR images. The gap between real LR images and artificial LR ones is large and will inevitably decrease the performance when applying methods trained on the artificial pairs to real LR images [100]. To settle this problem, real-world super-resolution [101] first estimates the parameters from real LR faces, such as the blur kernel, noise, and compression, and then generates the LR and HR face image pairs with estimated parameters for the training of the model.
LRGAN [102] proposes to learn the degradation before super-resolution from unpaired data. It designs a high-to-low GAN to learn the real degradation process from unpaired LR and HR face images and create paired LR and HR face images for training low-to-high GAN. Specifically, with HR face images as input, the high-to-low GAN generates LR face images (GLR) that should belong to the real LR distribution and be close to the corresponding downsampled HR face images. Then, for low-to-high GAN, GLRs are fed into the generator to recover the SR results that have to be close to HR face images and match the real HR distribution. Goswami et al. [103] further develop a robust FSR method and Zheng et al. [104] utilize semi-dual optimal transport to guide model learning and develop semi-dual optimal transport CycleGAN. Considering that discrepancies between GLRs in the training phase and real LR face images in the testing phase still exist, researchers introduce the concept of characteristic regularization (CR) [105]. Different from LRGAN, CR transforms the real LR face images into artificial LR ones and then conducts super-resolution reconstruction in the artificial LR space. Based on CycleGAN, CR learns the mapping between real LR face images and artificial LR ones. Then, it uses the artificial LR face images generated from real LR ones to fine-tune the super-resolution model, which is pretrained by the artificial pairs.
Generative prior-based methods: Recently, many face generation models, such as popular StyleGAN [58], StyleGAN v2 [106], ProGAN [107], StarGAN [108], and so on, have been proposed and they are capable of generating faithful faces with a high degree of variability. Thus, more and more researchers explore the generative prior of pretrained GAN.
The first generative prior-based FSR method is PULSE [109]. It formulates FSR as a generation problem to generate high-quality SR face image so that the downsampled SR result is close to LR face image. Mathematically, the problem can be expressed as
(14)
where z is a randomly sampled latent vector and the input of the pretrained StyleGAN [58], is the downsampling operation, s is the downsampling factor, and G denotes the function of the generator. PULSE solves FSR from a new perspective and this inspires many other works.
However, the latent code z in PULSE is randomly sampled and in low dimension, making the generated images lose important spatial information. To overcome this problem, GLEAN [110], CFP-GAN [111], and GPEN [112] are developed. Rather than directly employing the pretrained StyleGAN [58], they develop their own networks and embed the pretrained generation network of StyleGAN [58] into their own networks to incorporate the generative prior. To maintain faithful information, they not only obtain latent code by encoding LR face images instead of randomly sampling, but also extract multi-scale features from LR face images and fuse the features into the generation network. In this way, the generative prior provided by the pretrained StyleGAN can be fully utilized and the important spatial information can be well maintained.

4.1.3 Reinforcement Learning-based Methods.

Deep learning-based FSR methods learn the mapping from LR face images to HR ones, but ignore the contextual dependencies among the facial parts. Cao et al. propose to recurrently discover facial parts and enhance them by fully exploiting the global inter-dependency of the image, then attention-aware face hallucination via deep reinforcement learning (Attention-FH) is proposed [113]. Specifically, Attention-FH has two subnetworks: a policy network that locates the region that needs to be enhanced in the current step, and a local enhancement network that enhances the selected region.

4.1.4 Ensemble Learning-based Methods.

CNN-based methods utilize pixelwise loss to recover face images with higher PSNR and smoother details while GAN-based methods can generate face images with lower PSNR but more high-frequency details. To combine the advantages of different types of methods, ensemble learning is used in adaptive threshold-based multi-model fusion network (ATFMN) [114]. Specifically, ATFMN uses three models (CNN based, GAN based, and RNN based) to generate candidate SR faces, and then fuses all candidate SR faces to reconstruct the final SR result. In contrast to previous approaches, ATFMN exploits the potential of ensemble learning for FSR instead of focusing on a single model.

4.1.5 Discussion.

Here we discuss the pros and cons among these sub-categories in general FSR methods. From a global perspective, the difference between CNN-based and GAN-based methods relies on adversarial learning. CNN-based methods tend to utilize pixelwise loss, leading to higher PSNR and smoother results, while GAN-based methods might recover visually pleasing face images with more details but lower PSNR. Each of them has its own merits. Compared with them, ensemble learning-based method can combine their advantages to make up their deficiencies by integrating multiple models. However, ensemble learning inevitably results in the increase of memory, computation and parameters. Reinforcement learning-based methods recover the attentional local regions by sequentially searching, and consider the contextual dependency of patches from a global perspective, which brings improvement of performance but needs much more training time and computational cost.

4.2 Prior-guided FSR

General FSR methods aim to design efficient networks. Nevertheless, as a highly structured object, human face has some specific characteristics, such as prior information (including facial landmarks, facial parsing maps, and facial heatmaps), which are ignored by general FSR methods. Therefore, to recover facial images with a much clearer facial structure, researchers begin to develop prior-guided FSR methods.
Prior-guided FSR methods refer to extracting facial prior information and utilizing it to facilitate face reconstruction. Considering the order of prior information extraction and FSR, we further divide the prior-guided FSR methods into four parts: (i) pre-prior methods that extract prior information followed by FSR, (ii) parallel-prior methods that extract prior information and FSR simultaneously, (iii) in-prior methods that extract prior information from the intermediate results or features at the middle stag, and (iv) post-prior methods that extract prior information from FSR results. We illustrate the main frameworks of the four categories in Figure 4, outline the development of prior-guided FSR methods in Figure 5 and compare them on several key features in Table 3.
Fig. 4.
Fig. 4. Four frameworks of prior-guided FSR methods. PEN is prior estimation network, SRN is super-resolution network, FEN is a feature extraction network, and P is prior information.
Fig. 5.
Fig. 5. Milestones of prior-guided FSR methods. We simply list their names and venues.
Table 3.
 MethodsPriorExtractionFusion Strategies
PreLCGE [115]LandmarkPretrainedCrop
 MNCEFH [116]LandmarkPretrainedCrop
 PSFR-GAN [117]Parsing mapPretrainedConcatenation
 CAGFace [45]Parsing mapPretrainedConcatenation
 FSRG3DFP [120]3D priorJointSFT
 SeRNet [118]Parsing mapPretrainedIRB
ParallelCBN [121]Dense correspondence fieldJointConcatenation
 KPEFH [122]Parsing mapJoint
 JASRNet [123]HeatmapJoint
 ATSENet [124]Facial boundary heatmapJointFFU
InFSRNet [35]Landmark, parsing map, heatmapJointConcatenation
 FSRGFCH [125]HeatmapJointConcatenation
 DIC [36]HeatmapJoint, AFM
PostSuper-FAN [126]HeatmapJoint
 PFSRNet [127]HeatmapPretrained,
Table 3. Comparison of Prior-guided FSR Methods
To be short, we use Pre, Parallel, In, and Post to denote different prior-guided methods.

4.2.1 Pre-prior Methods.

These methods first extract face structure prior information and then feed the prior information to the beginning of FSR model. That is, they always extract prior information from LR face images by an extraction network that can be a pretrained network or a subnetwork associated with the FSR model, then take advantage of the prior information to facilitate FSR. To extract the accurate face structure prior, prior-based loss is always used in these methods to train their prior extraction network, which is defined as
(15)
where is the ground truth prior, P is extracted prior from the super-resolved face image, F can be 1 or 2, and the prior can be heatmap, landmark, and parsing maps in different methods.
In the early years, both LCGE [115] and MNCEFH [116] extract landmarks from LR face images to crop the faces into different components, and then predict high-frequency details for different components. However, accurate landmarks are unavailable especially when LR face images are tiny (i.e., 16 16). Thus, researchers turn to facial parsing maps [45, 117, 118, 119]. PSFR-GAN [117], SeRNet [118], and CAGFace [45] all pretrain a face structure prior extraction network to extract facial parsing maps. Then all of them except SeRNet directly concatenate the prior and LR face images as the input of the super-resolution model while SeRNet designs its improved residual block (IRB) to fuse the prior and features from LR face images. In addition, PSFR-GAN designs a semantic aware style loss to calculate the gram matrix loss for each semantic region separately. Later, super-resolution guided by three-dimensional (3D) facial priors (FSRG3DFP) [120] estimates 3D priors instead of 2D priors to learn 3D facial details and capture facial component information by the spatial feature transform block (SFT).

4.2.2 Parallel-prior Methods.

The above methods ignore the correlation between face structure prior estimation and FSR task: face prior estimation benefits from the enhancement of FSR and vice versa. Thus, parallel-prior methods that perform prior estimation and super-resolution in parallel are proposed, including cascaded bi-network (CBN) [121], KPEFH [122], JASRNet [123], SAAN [128], HaPFSR [129], OBC-FSR [130], and ATSENet [124]. They train the prior estimation and super-resolution networks jointly and require ground truth prior to calculate prior-based loss like Equation (15).
One of the most representative parallel-prior methods is JASRNet. Specifically, JASRNet utilizes a shared encoder to extract features for super-resolution and prior estimation simultaneously. Through this design, the shared encoder can extract the most expressive information for both tasks. In contrast to JASRNet, ATSENet not only extracts shared features for the two tasks, but also feeds features from the prior estimation branch into the feature fusion unit (FFU) in the super-resolution branch.

4.2.3 In-prior Methods.

Pre- and parallel-prior methods directly extract structure prior information from LR face images. Due to the low-quality of LR face images, extracting accurate prior information is challenging. To reduce the difficulty and improve the accuracy of prior estimation, researchers first coarsely recover LR face images and then extract prior information from the enhanced results of LR face images, including FSRNet [35], FSR guided by facial component heatmaps (FSRGFCH) [125], HCFR [131], deep-iterative-collaboration (DIC) [36, 132, 133, 134, 135, 136]. Similarly to parallel-prior methods, in-prior methods always jointly optimize the networks for two tasks.
Specifically, FSRNet [35], FSRGFCH [125], and HCFR [131] first upsample the LR face images to obtain intermediate results, then extract face structure prior from the intermediate results, and finally make use of the prior and intermediate results to recover the final results. FSRNet and FSRGFCH concatenate the intermediate results and the prior and feed the concatenated results into the following network to recover final SR results while HCFR utilizes the prior to segment the intermediate results and recovers final SR results by random forests. Considering that FSR and prior extraction should facilitate each other, DIC [36] proposes to iteratively perform super-resolution and prior extraction tasks. In the first iteration, DIC recovers a face with super-resolution model and extracts prior (heatmaps) from . In the ith iteration, both the LR face image and are fed into the super-resolution model to obtain , and then can be extracted. In this way, the two tasks can promote each other. Moreover, DIC builds an attention fusion module (AFM) to fuse facial prior and the LR face image efficiently.

4.2.4 Post-prior Methods.

In contrast to the above methods, post-prior methods extract the face structure prior from SR result rather than LR face image or intermediate result, and utilize the prior to design loss functions, including Super-FAN [126], progressive FSR network (PFSRNet) [127], and [137]. Super-FAN [126] and PFSRNet [127] first super-resolve LR face images and obtain SR results and then develop a prior estimation network to extract the heatmaps of SR face images and HR ones, and constrains the heatmaps of SR face images and HR ones to be close. PFSRNet further generates multi-scale super-resolved results and applies prior-based loss at every scale. In addition, PFSRNet utilizes heatmaps to generate a mask and calculates facial attention loss based on the masked SR and HR face images. Compared with the above methods, post-prior methods do not require prior extraction during the inference.

4.2.5 Discussion.

All prior-guided FSR methods need the ground truth of the face structure prior to calculate loss in the training phase. During the testing phase, all prior-guided FSR methods except post-prior methods need to estimate the prior. Due to the loss of information caused by image degradation, LR face images increase the difficulty and limit the accuracy of prior extraction in pre-prior methods, further limiting the super-resolution performance. Although parallel-prior methods can facilitate prior extraction and super-resolution simultaneously by sharing feature extraction, the improvement is still limited. In-prior methods extract prior from the intermediate result, which can improve the performance but increase the memory and computation cost caused by iterative super-resolution procedure especially in the iterative method (DIC) [36]. In post-prior methods, the prior only plays the role of the supervisor during training, while not participating in inference, and they cannot make full use of the specific prior of the input LR face image. Thus, a method that can exploit the prior fully without increasing additional memory or computation cost is on demand.

4.3 Attribute-constrained FSR

Facial attribute is also usually exploited in FSR, and they are called attribute-constrained FSR. As a kind of semantic information, facial attribute provides semantic knowledge, e.g., whether people wear glasses, which is useful for FSR. In the following, we will introduce some attribute-constrained FSR methods.
Different from face structure prior information of which acquisition relies on the image itself, attribute information can be available without LR face images, such as in criminal cases where attribute information may not be clear in LR face images but accurately known by witnesses. Thus, some researchers construct networks on the condition that attribute information is given, while others relax this by estimating attributes. According to this concept, attribute-constrained FSR methods can be divided into two frameworks: given attribute methods and estimated attribute methods. The overview is provided in Figure 6 and Table 4.
Fig. 6.
Fig. 6. Milestones of attribute-constrained FSR methods. Their names and venues are listed.
Table 4.
 Methods#AttributeAttribute embedding methods
GivenFSRSA [139]18Concatenation and
 EFSRSA [142]18Concatenation and
 AGCycleGAN [138]18Concatenation and
 AACNN [141]38Concatenation
 ATNet [140]NGConcatenation and
 ATSENet [124]NGConcatenation and
EstimatedRAAN [143]NGAttribute channel attention and
 FACN [144]18Attribute attention mask and
Table 4. Comparison of Attribute-constrained FSR Methods
“NG” denotes that the information is not given.

4.3.1 Given Attribute Methods.

Given the attribute information, how to integrate it into the super-resolution model is the key. For this problem, attribute-guided conditional CycleGAN (AGCycleGAN) [138], FSR with supplementary attributes (FSRSA) [139], expansive FSR with supplementary attributes (EFSRSA), attribute transfer network (ATNet) [140] and ATSENet [124] all directly concatenate attribute information and LR face image (or features extracted from LR face image). AGCycleGAN and FSRSA also feed the attribute into their discriminators to force the super-resolution model to notice the attribute information and develop attribute-based loss to achieve attribute matching, which is defined as
(16)
where A is attribute matched with while is the mismatched one. ATSENet feeds the super-resolved result into an attribute analysis network to calculate attribute prediction loss,
(17)
where is the predicted attribute of the network and is the ground truth attribute. However, Lee et al. [141] hold that LR face image and attributes belong to different domains, and direct concatenation is unsuitable and may decrease the performance. With regard to this view, Lee et al. construct an attribute augmented convolutional neural network (AACNN) [141], which extracts features from the attribute to boost face super-resolution.

4.3.2 Estimated Attribute Methods.

The above-mentioned given attribute methods work on the condition that all attributes are given, making them limited in real-world scenes where some attributes are missing. Although the missed attributes can be set as unknown, such as 0 or random values, the performance may drop sharply. To this end, researchers build modules to estimate attribute information for FSR. In estimated attribute methods, attribute-based loss forces the network to predict attribute information correctly, which is similar to Equation (17). Estimated attribute methods include residual attribute attention network (RAAN) [143] and facial attribute capsule network (FACN) [144]. RAAN is based on cascaded residual attribute attention blocks (RAAB). RAAB builds three branches to generate shape, texture, and attribute information, respectively, and introduces two attribute channel attention applied to shape and texture information. In contrast, FACN [144] integrates attributes in capsules. Specifically, FACN encodes LR face image into encoded features, and the features are fed into a capsule generation block that produces semantic capsules, probabilistic capsules, and facial attributes. Then, the attribute is viewed as a kind of mask to refine other features by multiplication or summation. With the combination of three information as input, the decoder of FACN can well recover the final SR results.

4.3.3 Discussion.

Given attribute methods require attribute information, making them only applicable in some restricted scenes. Although the attribute can be set as unknown in these methods, the performance may drop sharply. Toward the estimated attribute methods, they need to estimate the attribute and then utilize the attribute. Compared with given attribute methods, they have a wider range of applications but the accuracy of attribute estimation is difficult to guarantee in practice.

4.4 Identity-preserving FSR

Compared with face structure prior and attribute information, identity information containing identity-aware details is essential and identity-preserving FSR methods have received an increasing amount of attention in recent years. They aim to maintain the identity consistency between SR face image and LR one and improve the performance of down-stream face recognition. We show the overview and comparison of some representative methods in Figure 7 and Table 5.
Fig. 7.
Fig. 7. Milestones of identity-preserving FSR methods. Their names and venues are listed.
Table 5.
 MethodsLoss Functions
Face Recognition-basedSICNN [145]MSE loss on normalized and
 FH-GAN [146]MSE loss on and
 WaSRGAN [147] loss on and
 C-SRIP [150]Cross entropy loss on and
 IPFH [149]A-softmax loss on and
 SPGAN [99]Attention-based loss
Pairwise Data-basedSiGAN [157]Pair contrastive loss
 IADFH [158]Adversarial face verification loss
Table 5. Comparison of Identity-preserving FSR Methods
Notably, () is the residual map between () and .

4.4.1 Face Recognition-based Methods:.

To maintain identity consistency between and , in the training phase, a commonly used design is utilizing face recognition network to define identity loss, e.g., super-identity convolutional neural network (SICNN) [145], face hallucination generative adversarial network (FH-GAN) [146], WaSRGAN [147], [148], identity preserving face hallucination (IPFH) [149], cascaded super-resolution and identity priors (C-SRIP) [150, 151, 152, 153, 154] and ATSENet [124]. The framework of these methods consists of two main components: a super-resolution model and a pretrained face recognition network (FRN), probably an additional discriminator. The super-resolution model super-resolves the input LR face image, generating , which is fed into FRN to obtain its identity features. Simultaneously, is also fed into FRN, obtaining its identity features. The identity loss is calculated by
(18)
where FR is the function of FRN. F is 1 in WaSRGAN [147] and 2 in FH-GAN [146, 151]. Some methods calculate the loss on normalized features [145, 155], and some use A-softmax loss [149, 156]. Rather than directly extracting identity features from and , C-SRIP [150] feeds residual maps between (or ) and (upsampled by bicubic interpolation), respectively, into FRN, and applies cross-entropy loss on them. Moreover, C-SRIP generates multi-scale face images that are fed into different scale face recognition networks.
To fully explore the identity prior, SPGAN [99] feeds identity information extracted by the pretrained FRN to the discriminator at different scales, and designs attention-based identity loss. First, SPGAN generates two attention maps and ,
(19)
(20)
(21)
where E denotes the difference, denotes the element-wise multiplication, b is identity matrix, and is a 0-1 matrix. At ith row and jth column, is 0 when is negative, otherwise is 1. Then two attention maps are applied to the identity loss ,
(22)
where is the identity loss of SPGAN.

4.4.2 Pairwise Data-based Methods.

The training of FRN needs well-labeled datasets. However, a large well-labeled dataset is very costly. One solution is based only on the weakly-labeled datasets. In consideration of this, siamese generative adversarial network (SiGAN) [157] takes advantage of the weak pairwise label (in which different LR face images correspond to different identities) to achieve identity preservation. Specifically, SiGAN has twin GANs ( and ) that share the same architecture but super-resolve different LR face images ( and ) at the same time. As the identities of different LR face images are different, the identities of SR results corresponding to LR face images are also varied. Based on this observation, SiGAN designs an identity-preserving contrastive loss that minimizes the difference between same-identity pairs and maximizes the difference between different-identity pairs,
(23)
(24)
where is a function used to extract features from the intermediate layers of the generators, measures the distance between the features of and , y is 1 when two LR face images belong to the same identity, and y is 0 when LR face images belong to different identities.
Instead of feeding the pair data into twin generators, identity-aware deep face hallucination (IADFH) [158] feeds pair data into the discriminator. Its discriminator is a three-way classifier that generates fake, genuine and imposter: (i) HR and SR face images with the same or different identities ( or ) correspond to the fake, which forces the discriminator to distinguish and ; (ii) two different HR face images of the same identity () correspond to the genuine; and (iii) two HR face images with different identities () correspond to the imposter. The last two pairs force the discriminator to capture the identity feature. In this pattern, the generator can incorporate the identity information. The loss is called adversarial face verification loss (AFVL),
(25)
(26)
(27)
where () is the loss function of the discriminator (generator), and (can be 1, 1, 0) are the outputs of the discriminator for fake, genuine and imposter pairs.

4.4.3 Discussion.

Face recognition-based methods design identity loss based on face recognition network that is always pretrained. The training of a face recognition network requires well-labeled datasets that are costly. Instead, pairwise data-based methods take advantage of the contrast between different identities and the similarity between the same identity to maintain identity consistency without well-labeled datasets, which has a wider range of applications.

4.5 Reference FSR

The FSR networks discussed all exploit only the input LR face itself. In some conditions, we may obtain the high-quality face image of the same identity of the LR face image, for example, the person of the LR face image may have other high-quality face images. These high-quality face images can provide identity-aware face details for FSR. Thus, reference FSR methods utilize high-quality face image(s) as reference (R) to boost face restoration. Obviously, the reference face image can be only one image or multiple images. According to the number of R, a guided framework can be partitioned into single-face guided, multi-face guided, and dictionary-guided methods. An overview of reference FSR methods is shown in Figure 8 and the comparison of them is shown in Table 6.
Fig. 8.
Fig. 8. Milestones of reference FSR methods. We simply list their names and venues.
Table 6.
 MethodsSame identityAlignmentUtilization of R
Single-face guidedGFRNet [39]LandmarkConcatenation
 GWAInet [159]Flow fieldGFENet
Multi-face guidedASFFNet [37]Moving least-squareAFFB
 MEFSR [160]PWAve
Dictionary-guidedJSRFC [161]×LandmarkConcatenation
 DFDNet [38]×DFT
Table 6. Comparison of Reference FSR Methods
“–” denotes that the method does not contain the procedure.

4.5.1 Single-face Guided Methods.

At first, a high-quality face image that shares the same identity with the LR face image serves as R, such as guided face restoration network (GFRNet) [39], GWAInet [159]. Since the reference face image and LR face image may have different poses and expressions, which may hinder the recovery of face images, single-face guided methods tend to perform the alignment between the reference face image and the LR face image. After alignment, both the LR face image and aligned reference face image (we name it ) are fed into a reconstruction network to recover the SR result. The differences between GFRNet and GWAInet include two aspects: (i) GFRNet employs landmarks while GWAInet employs flow field to carry out the alignment and (ii) in the reconstruction network, GFRNet directly concatenates the LR face image and as the input. Nevertheless, GWAInet builds a GFENet to extract features from and transferring useful features of to the reconstruction network to recover SR results.

4.5.2 Multi-face Guided Methods.

Single-face guided methods set the problem as an LR face image only has one high-quality reference face image, but in some applications many high-quality face images are available, and they can further provide more complementary information for FSR. Adaptive spatial feature fusion network (ASFFNet) [37] is the first to explore multi-face guided FSR. Given multiple reference images, ASFFNet first selects the best reference image that should have the most similar pose and expression with LR face image by guidance selection module. However, misalignment and illumination differences still exist in the reference face image and the LR face image. Thus, ASFFNet applies weighted least-square alignment [162] and AdaIN [163] to cope with these two problems. Finally, they design an adaptive feature fusion block (AFFB) to generate an attention mask that is used to complement the information from LR face image and R. Multiple exemplar FSR (MEFSR) [160] directly feed all reference faces into weighted pixel average (PWAve) module to extract information for face restoration.

4.5.3 Dictionary-guided Methods.

It is observed that different people may have similar facial components. According to this observation, dictionary-guided methods are proposed, including joint super-resolution and face composite (JSRFC) [161] and deep face dictionary network (DFDNet) [38]. Dictionary-guided methods do not require the identity consistency between the reference face image and the LR face image, but build a component dictionary to boost face restoration. For example, JSRFC selects reference images that have similar components with the LR face image (every reference face image is labeled with a vector to indicate which components are similar.). Then, it aligns LR face image with the reference face image and extracts the corresponding components as a component dictionary. Finally, the dictionary components are used for the following face restoration. Different from JSRFC, Li et al. [38] build multi-scale component dictionaries based on features of the entire dataset. They use pretrained VGGFace [67] to extract features in different scales from high-quality faces, and then crop and resample four components with landmarks, and then cluster obtain K classes for every component by k-means. Given component dictionaries, they first select the most similar atoms for every component by the inner product, and then transfer the features from dictionary to the LR face image by dictionary feature transfer (DFT).

4.5.4 Discussion.

Single-face and multi-face guided FSR methods require one or multiple additional high-quality face image(s) with the same identity as the LR face image, which facilitates face restoration but limits their application, since the reference image may not exist. In addition, the alignment between low-quality LR face image and high-quality reference face image is also challenging in the reference FSR. Dictionary-guided methods break the restriction of the same identity, broadening the application but increasing the difficulty of face reconstruction.

4.6 Experiments and Analysis

To have a clear view of deep learning-based FSR methods, we compare the PSNR, SSIM, and LPIPS performance of the state-of-the-art algorithms on commonly used benchmark datasets (including CelebA [55], VGGFace2 [67], and CASIA-WebFace [69]) with upscale 4, 8, and 16. Considering that the reference FSR methods are different from other FSR methods, we compare other FSR methods and reference FSR methods individually.

4.6.1 Comparison Results of FSR Methods.

We first introduce the experimental settings and analyze the results of FSR methods.
Experimental Setting: For CelebA [55] dataset, 168,854 images are used for training and 1,000 images for testing following DIC [36]. All the images are cropped and resized into 128 128 as . We apply the degradation model in Equation (4) to generate . Facial landmarks are detected in References [164, 165, 166] and heatmaps are generated according to the landmarks. For facial parsing map, we adopt pretrained BiSeNet [167] to extract the parsing map from . For quality evaluation, PSNR and SSIM are introduced and both of them are computed on the Y channel of YCbCr space, which also follows DIC [36]. In addition, we further introduce the LPIPS to evaluate the performance of all comparison approaches. For the optimizer and learning rate when retraining different methods, we follow the setting in their original papers.
Experimental Results: We list and compare the results of some representative FSR methods in Table 7, including four general image super-resolution methods: SRCNN [70], VDSR [85], residual channel attention network (RCAN) [168], non-local sparse network (NLSN) [169], three general FSR methods: URDGN [34], WaSRNet [82], SPARNet [78], three prior-guided FSR methods: FSRNet [35], Super-FAN [126], DIC [36], two attribute-constrained FSR methods: FSRSA [142], AACNN [141], and three identity-preserving FSR methods: SICNN [145], SiGAN [157], and WaSRGAN [147]. Except that, we also report the parameters and FLOPs of these methods in the last two columns of Table 7. Note that the parameter and FLOPs are associated with the model with upscale 8. In addition, we also present the visual comparisons between a few state-of-the-art algorithms in Figure 9, Figure 10, and Figure 11.
Fig. 9.
Fig. 9. Qualitative comparison of different FSR approaches for 4 super-resolution reconstruction.
Fig. 10.
Fig. 10. Qualitative comparison of different FSR approaches for 8 super-resolution reconstruction.
Fig. 11.
Fig. 11. Qualitative comparison of different FSR approaches for 16 super-resolution reconstruction.
Table 7.
Methods×4×8×16ParamsFLOPs
 PSNR↑SSIM↑LPIPS↓PSNR↑SSIM↑LPIPS↓PSNR↑SSIM↑LPIPS↓  
General Image Super-Resolution Methods
SRCNN [70]28.040.8370.16023.930.6350.25620.540.4670.2910.01M0.3G
VDSR [85]31.250.9060.05526.360.7610.11222.420.5940.1860.6M11.0G
RCAN [168]31.690.9130.05127.300.7990.10023.320.6410.20415.0M4.7G
RCAN* [168]26.300.7690.17722.170.5210.26515.0M4.7G
NLSN [169]32.080.9190.04427.450.8040.09123.690.6710.15443.4M22.9G
NLSN* [169]30.820.8990.06543.4M22.9G
General FSR Methods
URDGN [34]30.110.8840.07525.620.7260.14822.290.5790.1851.0M14.6G
WaSRNet [82]30.920.9080.05126.830.7870.08923.130.6340.16071.5M19.2G
SPARNet [78]31.710.9130.04827.440.8040.08923.680.6740.13910.0M7.2G
Prior-guided FSR Methods
FSRNet [35]31.460.9080.05226.660.7710.11023.040.6290.1753.1M39.0G
Super-FAN [126]31.170.9050.04027.080.7880.05823.420.6520.1251.3M1.1G
DIC [36]31.440.9090.05327.410.8020.09223.470.6570.16020.8M14.8G
Attribute-constrained FSR Methods
FSRSA [142]30.800.8980.05826.190.7570.11122.840.6300.15376.9M0.9G
AACNN [141]31.300.9070.05226.680.7730.10022.980.6260.1713.3M0.2G
Identity-preserving FSR Methods
SICNN [145]31.590.9110.05027.180.7930.09523.500.6620.1524.9M5.4G
SiGAN [157]30.680.8920.03425.630.7400.06222.180.5960.09919.5M5.7G
WaSRGAN [147]30.720.9070.04525.550.7650.09222.780.6250.14871.5M19.2G
Table 7. Quantitative Evaluation of Various FSR Methods on CelebA in Terms of PSNR, SSIM, and LPIPS for ×4, ×8, and ×16
The best, the second-best, and the third-best results are emphasized with red, blue, and underscore, respectively. Note that Params and FLOPs are calculated for ×8 super-resolution model.
From these objective metrics and visual comparison results, we have the following observations:
(i) The retrained state-of-the-art general image super-resolution methods, such as RCAN and NLSN, are very competitive and even outperform the best FSR methods in terms of PSNR and SSIM. Meanwhile, as a general FSR method, SPARNet obtains the best performance among all the FSR methods. RCAN, NLSN, and SPARNet all do not explicitly incorporate the prior knowledge of face image, but they have obtained outstanding results. It shows that the design and optimization of the network is very important, and a well-designed network will have stronger fitting capabilities (less reconstruction errors). This observation will enlighten us that when we are designing a FSR deep network, it should be based on a strong backbone network.
(ii) The terms of RCAN* and NLSN* in Table 7 represent the pretrained models on general training images, and we directly download these models from the authors’ pages. Note that the pretrained results under certain magnification factors are not given (indicated as “—” in the table), because these methods are not trained under these magnification factors. RCAN and NLSN achieve better performance than RCAN* and NLSN*. This demonstrates that models trained by general images are not suitable for FSR but general image super-resolution methods trained by face images may perform well (sometimes even better than FSR methods on face images). Therefore, if we want to know and compare the performance of a newly proposed general image super-resolution on the task of FSR, then we cannot directly use the pretrained model released by the authors, but should retrain the model on the face image dataset. It should be noted that the objective results of these GAN-based FSR methods (e.g., URDGN, FSRSA, SiGAN, and WaSRGAN) are worse than those of NLSN*. This is mainly because that they often cannot get a better MSE due to the introduction of adversarial losses, which tend to allow the models to obtain perceptually better SR results but large reconstruction errors.
(iii) Compared with general image super-resolution methods and general FSR methods, these methods that incorporate facial characteristics do not perform well in terms of PSNR and SSIM. Nevertheless, we cannot conclude that it is meaningless to develop FSR methods that use facial characteristics. This is mainly because PSNR and SSIM may be not good assessment metrics for the task of image super-resolution [41], let alone for the task of FSR, in which human perception will be more important. To further exploit the super-resolution reconstruction capacity, we also introduce another assessment metric, LPIPS, which is more in line with human judgement. From the LPIPS results, we learn that these methods with low PSNR and SSIM may produce very good performance in terms of LPIPS, please refer to Super-FAN and SiGAN. This indicates that these methods that introduce facial characteristics can well represent the face image and recover the face contours and discriminant details.
(iv) When we compare FSR methods that use different facial characteristics, such as face structure prior, attributes, and identity, it is difficult to say which type of characteristic is more effective for FSR. Because these methods often use different backbone networks, and it is difficult to determine whether their performance changes are caused by the difference in the backbone network itself or because of the introduction of different facial characteristics. In practice, we can first develop a strong backbone and then incorporate facial characteristics to boost FSR.

4.6.2 Comparison Results of Reference FSR Methods.

The above FSR methods only require LR face images as input, while the reference FSR methods require LR face images and reference images. It is unfair to directly compare with these methods that do not use auxiliary high-resolution face images. Therefore, we compare the performance of the reference FSR methods individually.
Experimental Setting: Following ASFFNet [37], VGGFace2 [67] is reorganized into 106,000 groups and every group has 3–10 high-quality face images of the same identity, in which 10,000 groups are used for training set, 4,000 groups are for validation set and the remaining are testing set. In addition, two testing sets based on CelebA [55] and CASIA-WebFace [69] are also used, and each set contains 2,000 groups with 3–10 high-quality face images. We utilize facial landmarks to crop and resize all images into 256 × 256 as high-quality face images. To generate , the degradation model Equation (5), where J and ↓ are embodied as JPEG compression with quality q and bicubic interpolation respectively, is applied to the high-quality images. We consider two types of blur kernels, i.e., Gaussian blur and motion blur kernels, and randomly sample the scale s from {1:0.1:8}, the noise level from {0:1:15}, and the compression quality factor q from {10:1:60} [37]. PSNR, SSIM, and LPIPS [41] are used as metrics.
Experimental Results: The experimental results are shown in Table 8. To be specific, we list the results of GFRNet [39], GWAInet [159] and the latest proposed ASFFNet [37] on CelebA [55], VGGFace2 [67] and CASIA-WebFace [69] with upscale ×8. Note that all the results are copied from the article [37], since we have difficulty in reproducing these methods. Note that GFRNet and GWAInet are single-face guided methods while ASFFNet is multi-face guided method. To be fair, the reference image of GFRNet and GWAInet is the same as the selected image in ASFFNet. From Table 8, it is obvious that multi-face guided method ASFFNet performs better than single-face guided methods (GWAInet and GFRNet). ASFFNet considers the illumination difference between the reference face image and the LR face image, which is ignored by GFRNet and GWAInet, and builds a well-designed AFFB instead of simple concatenation to adaptively the features of the reference face image and the LR face image. These two points contribute to the excellent performance of ASFFNet. Thus, difference (i.e., misalignment, illumination difference, and so on) elimination and effective information fusion of the reference face image and the LR face image are both important in reference FSR methods.
Table 8.
MethodsCelebA [55]VGGFace2 [67]CASIA-WebFace [69]
 PSNR↑SSIM↑LPIPS↓PSNR↑SSIM↑LPIPS↓PSNR↑SSIM↑LPIPS↓
GFRNet [39]25.930.9010.22723.850.8790.26327.190.9120.307
GWAInet [159]25.770.9010.21023.870.8790.26127.180.9100.250
ASFFNet [37]26.390.9050.18524.340.8810.23827.690.9210.219
Table 8. Quantitative Evaluation of Various Reference FSR Methods on CelebA [55], VGGFace2 [67], and CASIA-WebFace [69] in Terms of PSNR, SSIM, and LPIPS for ×8
The best, the second-best and the third-best results are emphasized with red, blue and underscore respectively.

4.7 Joint FSR and Other Tasks

Although the above FSR methods have achieved a breakthrough, FSR is still challenging and complex, since the input face images are often affected by many factors, including shadow, occlusion, blur, abnormal illumination, and so on. To recover these face images efficiently, some work is proposed to consider degradation caused by low-quality and other factors together. Moreover, researchers also jointly perform FSR and other tasks. In the following, we will review these joint FSR and other tasks methods.

4.7.1 Joint Face Completion and Super-resolution.

Both low-resolution and occlusion or shadowing always coexist in the real-world face images. Thus, the restoration of faces degraded by these two factors is important. The simplest way is to first complete the occluded part and then super-resolve the completed LR face images [170]. However, the results always contain large artifacts due to the accumulation of errors. Cai et al. [171] propose the FCSR-GAN method that pretrains a face completion model (FCM), and combines FCM with super-resolution model (SRM), then trains SRM with the fixed FCM, and finally finetunes the whole network. Then, Liu et al. [172] propose a graph convolution pyramid blocks, which only needs one step to be trained rather than multiple steps of FCSR-GAN. In contrast, Pro-UIGAN [173] utilizes facial landmark to capture facial geometric prior and recovers occluded LR face images progressively.

4.7.2 Joint Face Deblurring and Super-resolution.

Blurry LR face images always arise in real surveillance and sports videos, which cannot be recovered effectively by a single task model, e.g., super-resolution or deblurring model. In the literature, Yu et al. [174] develop SCGAN to deblur and super-resolve the input jointly. Then, Song et al. [175] find that the previous methods ignore the utilization of facial prior information and the recovered face image are lack of high-frequency details. Thus, they first utilize a parsing map and LR face image to recover a basic result, and then feed the basic result into detail enhancement module to compensate high-frequency details from the high-quality exemplar. Later, DGFAN [176] develops two feature extraction modules for different tasks to extract features and imports them into well-designed gated fusion modules to generate deblurred high-quality results. Xu et al. [177] incorporate face recognition network with face restoration to improve the identifiability of the recovered face images.

4.7.3 Joint Illumination Compensation and FSR.

Abnormal illumination FSR has also attracted the attention of many scholars. SeLENet [178] decomposes a face image into a normal face, an albedo map and a lighting coefficient, then replaces the lighting coefficient with the standard ambient white light coefficient, and then reconstructs the corresponding neutral light face image. Ding et al. [179] build a pipeline of face detection, and then recover the detected faces with landmarks. Zhang et al. [180] utilize a normal illumination external HR guidance to guide abnormal illumination LR face images for illumination compensation. They develop a copy-and-paste GAN (CPGAN), including an internal copy-and-paste network to utilize face intern information for reconstruction, and an external copy-and-paste network is applied to compensate illumination. Based on CPGAN, they further improve the external copy-and-paste network by introducing recursive learning and incorporating landmark estimation and develop the recursive CPGAN [181]. In contrast, Yasarla et al. [182] introduce network architecture search into face enhancement to design efficient network and extract identity information from HR guidance to restore face images.

4.7.4 Joint Face Alignment and Super-resolution.

The above FSR methods require the all the HR training sample to be aligned. Thus, the misalignment of the input LR face image to the training face images often leads to sharp performance decrease and artifacts. Therefore, a set of joint face alignment and super-resolution methods are developed. Yu et al. [49] insert multiple spatial transformer networks (STN) [183] into the generator to achieve face alignment, and develop TDN and MTDN [184]. As LR face images can be noisy and unaligned, Yu et al. build the TDAE method [185]. TDAE first upsamples and coarsely aligns LR face images to produce , then downsamples and obtains to reduce noise, and then upsamples for the final reconstruction.

4.7.5 Joint Face Frontalization and Super-resolution.

Faces in the real world have various poses, and some of them may not be frontal. When existing FSR methods are applied to non-frontal faces, the reconstruction performance drops sharply and has poor visual quality. Artifacts exist even when FSR and face frontalization are performed in sequence or inverse order. To alleviate this problem, the method in Reference [186] first takes advantage of STN and CNN to coarsely frontalize and hallucinate the faces, and then designs a fine upsampling network for refining face details. Yu et al. [187] propose a transformative adversarial neural network for joint face frontalization and hallucination. The method builds a transformer network to encode non-frontal LR face images and frontal LR ones into the latent space and requires the non-frontal one to be close to the frontal one, and then the encoded latent representations are imported into the upsampling network to recover the final results. Tu et al. [188] first train face restoration network and face frontalization network separately, and then propose task-integrated training strategy to merge two networks into a unified network for face frontalization and super-resolution. Note that face alignment aims to generate SR face images with the same pose as HR ones while face frontalization is to recover frontal SR faces from non-frontal LR faces.

4.8 Related Applications

Except the above-mentioned FSR methods and joint methods, a large number of new methods related to FSR have emerged in recent years, including face video super-resolution, old photo restoration, audio-guided FSR, 3D FSR, and so on, which are introduced in the following.

4.8.1 Face Video Super-resolution.

Faces usually appear in LR video sequences, such as surveillance. The correlation between frames can provide more complementary details, which benefit the face reconstruction. One direct solution is to fuse multi-frame information and exploit inter-frame dependency [189]. The approach in Reference [190] employs a generator to generate the SR results for every frame, and a fusion module is applied to estimate the central frame. Considering that the aforementioned methods cannot model the complex temporal dependency, Xin et al. [191] propose a motion-adaptive feedback cell that captures inter-frame motion information and updates the current frames adaptively. In Reference [192], based on the assumption that multiple super-resolved frames are crucial for the reconstruction of the subsequent frame, and thus it designs a recurrence strategy to make better use of inter-frame information. Inspired by the powerful transformer, the work of Reference [193] develops the first pure transformer-based face video hallucination model. MDVDNet [194] incorporates multiple priors from the video, including speech, semantic elements and facial landmarks to enhance the capability of deep learning-based method.

4.8.2 Old Photo Restoration.

Restoration of old pictures is vital and difficult in the real world, since the degradation is too complex to be stimulated. Naturally, one solution is to learn the mapping from a real LR face image (regarding real old images as real LR face images) to an artificial LR face images, and then apply the existing FSR methods to the generated artificial LR face image. BOPBL [195] proposes to transform images at latent space rather than image space. Specifically, BOPBL first encodes real and artificial LR face images into the same latent space , and encodes HR face images into another latent space , and then maps into by a mapping network.

4.8.3 Audio-guided FSR.

Considering that audio carries face-related information [196], Meishvili et al. [197] develop the first audio-guided FSR method. Due to the difference of multi-modal, they build two encoders to encode image and audio information. Then the encoded representations of images and the audio are fused, and the fused results are fed into the generator to recover the final SR results. The introduction of the audio in FSR is novel and inspires researchers to exploit cross modal information, but is challenging due to the differences between different modalities.

4.8.4 3d FSR.

Human face is the most concerned object in the field of computer vision. With the development of 2D technology, a large number of 3D methods are often proposed, because they can provide more useful features for face reconstruction and recognition. In the FSR society, the early 3D FSR approach is proposed by Pan et al. [198]. Berretti et al. [199] propose a superface model from a sequence of low-resolution 3D scans. The approach in Reference [200] takes only the rough, noisy, and low-resolution depth image as input, and predicts the corresponding high-quality 3D face mesh. By establishing the correspondence between the input LR face and 3D textures, Qu et al. present a patch-based 3D FSR on the mesh [201]. Benefiting from the development of deep learning technology, most recently, a 3D face point cloud super-resolution network approach is developed to infer the high-resolution data from low-resolution 3D face point cloud data [202].

5 Conclusion and Future Directions

In this review, we have presented a taxonomy of deep learning-based FSR methods. According to facial characteristics, this field can be divided into five categories: general FSR methods, prior-guided FSR methods, attribute-constrained FSR methods, identity-preserving FSR methods, and reference FSR methods. Then, every category is further divided into some subcategories depending on the design of the network architecture or the specific utilization of facial characteristics. In particular, general FSR methods are further divided into basic CNN-based methods, GAN-based methods, reinforcement learning-based methods, and ensemble learning-based methods. Besides, other methods combining facial characteristics are categorized according to the specific utilization pattern of facial characteristics. We also compare the performance of states of the art and give some deep analysis. Of course, FSR technique is not limited to the methods we presented, and a panoramic view of this fast-expanding field is rather challenging, thereby resulting in possible omissions. Therefore, this review serves as a pedagogical tool, providing researchers with insights into typical methods of FSR. In practice, researchers could use these general guidelines to develop the most suitable technique for their specific studies.
Despite great breakthroughs, FSR still presents many challenges and is expected to continue its rapid growth. In the following, we simply provide an outlook on the problems to be solved and trends to expect in the future.
Design of Network. From the comparison results with the state-of-the-art general image super-resolution methods, we learn that the backbone network has a crucial impact on the performance, especially in terms of PSNR and SSIM. Therefore, we can learn from the general image super-resolution task, in which many well-designed network structures have been continuously proposed (IPT [203] and SwinIR [204]), and design an effective deep network that is more suitable for FSR task. In addition to the effectiveness, an efficient network is also needed in practice, where the large model (with a mass of parameters and high computation costs) is very difficult to be deployed in real-world applications. Hence, developing models with lighter structure and lower computational taxing is still a major challenge.
Exploitation of Facial Prior. As a domain-specific super-resolution technique, FSR can be used to recover the facial details that are lost in the observed LR face images. The key to the success of FSR is to effectively exploit the prior knowledge of human faces, from 1D vector (identity and attributes), to 2D images (facial landmarks, facial heatmaps and parsing maps), and to 3D models. Therefore, discovering new prior knowledge of human face, how to model or represent these prior knowledge, and how to integrate this information organically into the end-to-end training framework are worthy of further discussion. In addition to these explicit prior knowledge, how to model and utilize the implicit prior that is learned from the data (such as the GAN prior [58, 106]) may be another direction.
Metrics and Loss Functions. As we know, the pixelwise loss or loss tend to produce the super-resolution results with high PSNR and SSIM values, while perceptual loss and adversarial loss are in favor of letting the model produce some visually pleasant results, i.e., good performance in terms of LPIPS and FID. Therefore, the assessment metric plays an important role in guiding the model optimization and affecting the final results. If we want to obtain a trustable result (in criminal investigation application), then PSNR and SSIM may be better metrics. In contrast, if we just want some visually pleasant results, then employing LPIPS and FID metrics may be a good choice. As a result, there is no universal assessment metric that can make the best of both worlds. Therefore, assessment metrics for FSR need more exploration in the future.
Discriminate FSR. In most situations, our goal is not only to reconstruct a visually pleasing HR face image. Actually, we hope that the super-resolved results can improve the face recognition task by human or computer. Therefore, it would be beneficial to recover a discriminated HR face image (for human) or discriminated feature (for computers) from an LR face image. To enhance the discriminant of super-resolved face images, we can use the weakly-supervised information (paired positive or negative samples) of the training sample to force the model to be able to reconstruct a discriminative face image.
Real-world FSR. The degradation process in the real world is too complex to be simulated, which results in a large gap between the synthesized LR and HR pairs and real-world data. When applying models trained by synthesized pairs to real-world LR face images, their performance drops dramatically. Given the HR training face images and the unpaired real-world LR face images, some methods [102, 205, 206] have been proposed to learn the real image degradation to create the sample pairs of synthesis LR face images and HR face images. These methods achieve better performance than previous approaches trained with the data produced by bicubic degradation. These methods actually have a potential assumption that all real-world LR face images share the same degradation, i.e., captured from the same camera. However, the obtained real-world LR face images are very different, and their degradation processes are different. Therefore, designing a more robust real-world FSR method is one of the problem has to be settled urgently.
Multi-modal FSR. Due to the rapid development of sensing technology, multiple sensors in the same system, such as autonomous driving and robots are becoming more and more common. The utilization of multi-modal information (including audio, depth, near infrared) will be increasingly promoted. Evidently, different modalities provide different clues. In this field, researchers always explore image-related information, such as attribute, identity, and others. Nevertheless, the emergence of audio-guided FSR [197] and hyperspectral FSR [207] inspire us to take advantage of information belonging to different modalities. This trend will undoubtedly continue and diffuse into every category in this field. The introduction of multi-modal information will also spur the development of FSR.

Footnote

1
A curated list of papers and resources to face super-resolution at https://github.com/junjun-jiang/Face-Hallucination-Benchmark.

References

[1]
Y. Liang, J. H. Lai, W. S. Zheng, and Z. Cai. 2012. A survey of face hallucination. Biometric Recognition. Springer, Berlin, 83–93.
[2]
N. Wang, D. Tao, X. Gao, X. Li, and J. Li. 2013. A comprehensive survey to face hallucination. Int. J. Comput. Vis. 106, 1, 9–30.
[3]
M. P. Autee, M. S. Mehta, M. S. Desai, V. Sawant, and A. Nagare. 2015. A review of various approaches to face hallucination. Proced. Comput. Sci. 45 (2015), 361–369.
[4]
S. Kanakaraj, V. K. Govindan, and S. Kalady. 2017. Face super resolution: A survey. Int. J. Image Graph. Sign. Process. 9, 5 (2017), 54–67.
[5]
K. Nguyen, C. Fookes, S. Sridharan, M. Tistarelli, and M. Nixon. 2018. Super-resolution for biometrics: A comprehensive survey. Pattern Recogn. 78 (2018), 23–42.
[6]
S. S. Rajput, K. V. Arya, V. Singh, and V. K. Bohat. 2018. Face hallucination techniques: A survey. In Proceedings of the Conference on Information and Communication Technology (CICT’18). IEEE, 1–6.
[7]
H. Liu, X. Zheng, J. Han, Y. Chu, and T. Tao. 2019. Survey on gan–based face hallucination with its model development. IET Image Proc. 13, 14 (2019), 2662–2672.
[8]
S. Baker and T. Kanade. 2000. Hallucinating faces. In Proceedings of the IEEE International Conference on Automatic Face and Gesture Recognition, 83–88.
[9]
C. Liu, H. Y. Shum, and C. S. Zhang. 2001. A two-step approach to hallucinating faces: Global parametric model and local nonparametric model. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’01). IEEE, I-193–I198.
[10]
B. K. Gunturk, A. U. Batur, Y. Altunbasak, M. H. Hayes, and R. M. Mersereau. 2003. Eigenface-domain super-resolution for face recognition. IEEE Trans. Image Process. 12, 5 (2003), 597–606.
[11]
X. Wang and X. Tang. 2005. Hallucinating face by eigentransformation. IEEE Trans. Syst. Man Cybernet. C 35, 3 (2005), 425–434.
[12]
A. Chakrabarti, A. N. Rajagopalan, and R. Chellappa. 2007. Super-resolution of face images using kernel pca-based prior. IEEE Trans. Multimedia 9, 4 (2007), 888–892.
[13]
J. Park and S. Lee. 2008. An example-based face hallucination method for single-frame, low-resolution facial images. IEEE Trans. Image Process. 17, 10 (2008), 1806–1816.
[14]
P. Innerhofer and T. PockInnerhofer. 2013. A Convex Approach for Image Hallucination. AAPRW.
[15]
Y. Liang, X. Xie, and J. H. Lai. 2013. Face hallucination based on morphological component analysis. Sign. Process. 93, 2 (2013), 445–458.
[16]
C. Y. Yang, S. Liu, and M. H. Yang. 2013. Structured face hallucination. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1099–1106.
[17]
H. Chang, D. Y. Yeung, and Y. Xiong. 2004. Super-resolution through neighbor embedding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’04). IEEE, 1–8.
[18]
X. Ma, J. Zhang, and C. Qi. 2010. Hallucinating face by position-patch. Pattern Recogn. 43, 6 (2010), 2224–2236.
[19]
C. Jung, L. Jiao, B. Liu, and M. Gong. 2011. Position-patch based face hallucination using convex optimization. IEEE Sign. Process Lett. 18, 6 (2011), 367–370.
[20]
J. Jiang, R. Hu, Z. Wang, and Z. Han. 2014. Noise robust face hallucination via locality-constrained representation. IEEE Trans. Multimedia 16, 5 (2014), 1268–1281.
[21]
R. A. Farrugia and C. Guillemot. 2017. Face hallucination using linear models of coupled sparse support. IEEE Trans. Image Process. 26, 9 (2017), 4562–4577.
[22]
J. Jiang, Y. Yu, S. Tang, J. Ma, A. Aizawa, and K. Aizawa. 2020. Context-patch face hallucination based on thresholding locality-constrained representation and reproducing learning. IEEE Trans. Cybernet. 50, 1 (2020), 324–337.
[23]
J. Shi, X. Liu, Y. Zong, C. Qi, and G. Zhao. 2018. Hallucinating face image by regularization models in high-resolution feature space. IEEE Trans. Image Process. 27, 6 (2018), 2980–2995.
[24]
L. Chen, J. Pan, and Q. Li. 2019. Robust face image super-resolution via joint learning of subdivided contextual model. IEEE Trans. Image Process. 28, 12 (2019), 5897–5909.
[25]
J. Shi and G. Zhao. 2019. Face hallucination via coarse-to-fine recursive kernel regression structure. IEEE Trans. Multimedia 21, 9 (2019), 2223–2236.
[26]
L. Liu, C. P. Chen, and S. Li. 2020. Hallucinating color face image by learning graph representation in quaternion space. IEEE Trans. Cybern.1–13.
[27]
L. Chen, J. Pan, J. Jiang, J. Zhang, Z. Han, and L. Bao. 2021. Multi-stage degradation homogenization for super-resolution of face images with extreme degradations. IEEE Trans. Image Process. 30 (2021), 5600–5612.
[28]
Y. Zhuang, J. Zhang, and F. Wu. 2007. Hallucinating faces: Lph super-resolution and neighbor reconstruction for residue compensation. Pattern Recogn. 40, 11 (2007), 3178–3194.
[29]
H. Huang, H. He, X. Fan, and J. Zhang. 2010. Super-resolution of human face image using canonical correlation analysis. Pattern Recogn. 43, 7 (2010), 2532–2543.
[30]
Z. Wang, J. Chen, and S. C. H. Hoi. 2021. Deep learning for image super-resolution: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 43, 10 (2021), 3365–3387.
[31]
S. Anwar, S. Khan, and N. Barnes. 2020. A deep journey into super-resolution. ACM Comput. Surv. 53, 3 (2020), 1–34.
[32]
W. Yang, X. Zhang, Y. Tian, W. Wang, J. H. Xue, and Q. Liao. 2019. Deep learning for single image super-resolution: A brief review. IEEE Trans. Multimedia 21, 12 (2019), 3106–3121.
[33]
H. Liu, Z. Ruan, P. Zhao, F. Shang, L. Yang, and Y. Liu. Video super resolution based on deep learning: A comprehensive survey. arXiv:2007.12928. Retrieved from https://arxiv.org/abs/2007.12928.
[34]
X. Yu and F. Porikli. 2016. Ultra-resolving face images by discriminative generative networks. InProceedings of the European Conference on Computer Vision (ECCV’16). Springer International Publishing, 318–333.
[35]
Y. Chen, Y. Tai, X. Liu, C. Shen, and J. Yang. 2018. FSRNet: End-to-end learning face super-resolution with facial priors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’18). IEEE, 2492–2501.
[36]
C. Ma, Z. Jiang, Y. Rao, J. Lu, and J. Zhou. 2020. Deep face super-resolution with iterative collaboration between attentive recovery and landmark estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’20). IEEE, 5569–5578.
[37]
X. Li, W. Li, D. Ren, H. Zhang, M. Wang, and W. Zuo. 2020. Enhanced blind face restoration with multi-exemplar images and adaptive spatial feature fusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’20). IEEE, 2706–2715.
[38]
X. Li, C. Chen, S. Zhou, X. Lin, W. Zuo, and L. Zhang. 2020. Blind face restoration via deep multi-scale component dictionaries. In Proceedings of the European Conference on Computer Vision (ECCV’20). Springer International Publishing, 399–415.
[39]
X. Li, M. Liu, Y. Ye, W. Zuo, L. Lin, and R. Yang. 2018. Learning warped guidance for blind face restoration. In Proceedings of the European Conference on Computer Vision (ECCV’18). Springer International Publishing, 278–296.
[40]
M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter. 2017. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Proceedings of the Conference on Neural Information Processing Systems (NIPS’17). 6626–6637.
[41]
R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang. 2018. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’18). IEEE, 586–595.
[42]
W. Zhou and A. C. Bovik. 2002. A. Universal image quality index. IEEE Sign. Process. Lett. 9, 3 (2002), 81–84.
[43]
Z. Wang, E. P. Simoncelli, and A. C. Bovik. 2003. Multiscale structural similarity for image quality assessment. In Proceedings of the Asilomar Conference on Signals, Systems & Computers. IEEE, 1398–1402.
[44]
A. Mittal, R. Soundararajan, and A. C. Bovik. 2013. Making a “completely blind” image quality analyzer. IEEE Sign. Process Lett. 20, 3 (2013), 209–212.
[45]
R. Kalarot, T. Li, and F. Porikli. 2020. Component attention guided face super-resolution network: CAGFace. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV’20). IEEE, 359–369.
[46]
W. S. Lai, J. B. Huang, N. Ahuja, and M. H. Yang. 2017. Deep laplacian pyramid networks for fast and accurate super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). IEEE, 624–632.
[47]
K. Simonyan and A. Zisserman. 2015. Very deep convolutional networks for large-scale image recognition. In Proceedings of the 3rd International Conference on Learning Representations (ICLR’15).
[48]
I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, X. Bing, and Y. Bengio. 2020. Generative adversarial networks. Commun. ACM 63, 11 (2020), 139–144.
[49]
X. Yu and F. Porikli. 2017. Hallucinating very low-resolution unaligned and noisy face images by transformative discriminative autoencoders. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). IEEE, 4327–4333.
[50]
M. Arjovsky, S. Chintala, and L. Bottou. 2017. Wasserstein generative adversarial networks. In Proceedings of the International conference on machine learning (ICML’17). PMLR, 214–223.
[51]
I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. Courville. 2017. Improved training of wasserstein GANs. In Proceedings of the Conference on Neural Information Processing Systems (NIPS’17). 5767–5777.
[52]
J. Y. Zhu, T. Park, P. P. Isola, and A. A. Efros. 2017. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’17). IEEE, 2223–2232)
[53]
L. A. Gatys, A. S. Ecker, and M. Bethge. 2016. Image style transfer using convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). IEEE, 2414–2423.
[54]
T. C. Wang, M. Y. Liu, J. Y. Zhu, A. Tao, J. Kautz, and B. Catanzaro. 2018. High-resolution image synthesis and semantic manipulation with conditional GANs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, 8798–8807.
[55]
Z. Liu, P. Luo, X. Wang, and X. Tang. 2015. Deep learning face attributes in the wild. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’15). IEEE, 3730–3738.
[56]
V. Le, J. Brandt, L. Zhe, L. D. Bourdev, and T. S. Huang. 2012. Interactive facial feature localization. In Proceedings of the European Conference on Computer Vision (ECCV’12). Springer, Berlin, 679–692.
[57]
C. H. Lee, Z. Liu, L. Wu, and P. Luo. 2020. MaskGAN: Towards diverse and interactive facial image manipulation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’20). IEEE, 5549–5558.
[58]
T. Karras, S. Laine, and T. Aila. 2019. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’19). IEEE, 4396–4405.
[59]
M. Koestinger, P. Wohlhart, P. M. Roth, and H. Bischof. 2011. Annotated facial landmarks in the wild: A large-scale, real-world database for facial landmark localization. In Proceedings of the IEEE International Conference on Computer Vision Workshops (ICCVW’11). IEEE, 2144–2151.
[60]
C. Sagonas, G. Tzimiropoulos, S. Zafeiriou, and M. Pantic. 2013. A semi-automatic methodology for facial landmark annotation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW’13). IEEE, 896–903.
[61]
A. Bulat and G. Tzimiropoulos. 2017. How far are we from solving the 2D & 3D face alignment problem? (and a dataset of 230,000 3D facial landmarks). In Proceedings of the IEEE International Conference on Computer Vision (ICCV’17). IEEE, 1021–1030.
[62]
S. Zafeiriou, G. Trigeorgis, G. Chrysos, J. Deng, and J. Shen. 2017. The menpo facial landmark localisation challenge: A step towards the solution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW’17). IEEE, 170–179.
[63]
G. B. Huang, M. Mattar, T. Berg, and E. Learned-Miller. 2007. Labeled Faces in the Wild: A. Database for Studying Face Recognition in Unconstrained Environments. Technical Report 07-49, University of Massachusetts, Amherst, 2007.
[64]
L. Wolf, T. Hassner, and Y. Taigman. 2011. Effective unconstrained face recognition by combining multiple descriptors and learned background statistics. IEEE Trans. Pattern Anal. Mach. Intell. 33, 10 (2011), 1978–1990.
[65]
O. M. Parkhi, A. Vedaldi, and A. Zisserman. 2015. Deep face recognition. In Procedings of the British Machine Vision Conference (BMVC’15). British Machine Vision Association, 1–12.
[66]
B. Chen, C. Chen, and W. H. Hsu. 2015. Face recognition and retrieval using cross-age reference coding with cross-age celebrity dataset. IEEE Trans. Multimedia 17, 6 (2015), 804–815.
[67]
Q. Cao, L. Shen, W. Xie, O. M. Parkhi, and A. Zisserman. 2018. VGGFace2: A dataset for recognising faces across pose and age. In Proceedings of the 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG’18). IEEE, 67–74.
[68]
A. Bansal, A. Nanduri, C. Castillo, R. Ranjan, and R. Chellappa. 2017. UMDFaces: An annotated face dataset for training deep networks. In Proceedings of the IEEE International Joint Conference on Biometrics (IJCB’17). IEEE.
[69]
Y. Dong, L. Zhen, S. Liao, and S. Z. Li. 2014. Learning Face Representation from Scratch. arXiv:1411.7923. Retrieved from https://arxiv.org/abs/1411.7923.
[70]
C. Dong, C. C. Loy, K. He, and X. Tang. 2016. Image super-resolution using deep convolutional networks. IEEE Trans. Pattern Anal. Mach. Intell. 38, 2 (2016), 295–307.
[71]
E. Zhou, H. Fan, Z. Cao, Y. Jiang, and Q. Yin. 2015. Learning face hallucination in the wild. In Proceedings of the AAAI Annual Conference on Artificial Intelligence (AAAI’15). 3871–3877.
[72]
W. Huang, Y. Chen, M. Li, and Y. Hui. 2017. Super-resolution reconstruction of face image based on convolution network. In Advances in Intelligent Systems and Computing. Springer International Publishing, 288–294.
[73]
D. Huang and H. Liu. 2016. Face hallucination using convolutional neural network with iterative back projection. In Biometric Recognition. Springer International Publishing, 167–175.
[74]
X. Chen, X. Wang, Y. Lu, W. Li, Z. Wang, and Z. Huang. 2020. Rbpnet: An asymptotic residual back-projection network for super-resolution of very low-resolution face image. Neurocomputing 376 (2020), 119–127.
[75]
X. Chen and Y. Wu. 2020. Efficient face super-resolution based on separable convolution projection networks. In Proceedings of the 5th International Conference on Control, Robotics and Cybernetics (CRC’20). IEEE, 92–97.
[76]
Y. Liu, Z. Dong, K. Pang Lim, and N. Ling. 2020. A densely connected face super-resolution network based on attention mechanism. In Proceedings of the 15th IEEE Conference on Industrial Electronics and Applications (ICIEA’20). IEEE, 148–152.
[77]
V. Chudasama, K. Nighania, K. Upla, K. Raja, R. Ramachandra, and C. Busch. 2021. E-comsupresnet: Enhanced face super-resolution through compact network. IEEE Trans. Biom. Behav. Ident. Sci. 3, 2 (2021), 166–179.
[78]
C. Chen, D. Gong, H. Wang, Z. Li, and K. Y. K. Wong. 2021. Learning spatial attention for face super-resolution. IEEE Trans. Image Process. 30 (2021), 1219–1231.
[79]
L. Han, H. Zhen, G. Jin, and D. Xin. 2018. A noise robust face hallucination framework via cascaded model of deep convolutional networks and manifold learning. In Proceedings of the IEEE International Conference on Multimedia (ICME’18). 1–6.
[80]
H. Nie, Y. Lu, and J. Ikram. 2016. Face hallucination via convolution neural network. In Proceedings of the IEEE 28th International Conference on Tools With Artificial Intelligence (ICTAI’16). IEEE, 485–489.
[81]
Z. Chen, J. Lin, T. Zhou, and F. Wu. 2021. Sequential gating ensemble network for noise robust multiscale face restoration. IEEE Trans. Cybern. 51, 1 (2021), 451–461.
[82]
H. Huang, R. He, Z. Sun, and T. Tan. 2017. Wavelet-SRNet: A wavelet-based CNN for multi-scale face super resolution. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’17). IEEE, 1689–1697.
[83]
Y. Liu, D. Sun, F. Wang, L. K. Pang, and Y. Lai. 2020. Learning wavelet coefficients for face super-resolution. The Visual Computer 37, 7 (2020), 1613–1622.
[84]
X. Hu, P. Ma, Z. Mai, S. Peng, Z. Yang, and L. Wang. 2019. Face hallucination from low quality images using definition-scalable inference. Pattern Recogn. 94 (2019), 110–121.
[85]
J. Kim, J. K. Lee, and K. M. Lee. 2016. Accurate image super-resolution using very deep convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). IEEE, 1646–1654.
[86]
W. Ko and S. Chien. 2016. Patch-based face hallucination with multitask deep neural network. In Proceedings of the IEEE International Conference on Multimedia and Expo (ICME’16). IEEE, 1–6.
[87]
Z. Feng, J. Lai, X. Xie, D. Yang, and M. Ling. 2016. Face hallucination by deep traversal network. In Proceedings of the International Association of Pattern Recognition (ICPR’16). 3276–3281.
[88]
T. Lu, H. Wang, Z. Xiong, J. Jiang, Y. Zhang, H. Zhou, and Z. Wang. 2017. Face hallucination using region-based deep convolutional networks. In Proceedings of the International Conference on Image Processing (ICIP’17). 1657–1661.
[89]
O. Tuzel, Y. Taguchi, and J. R. Hershey. 2016. Global-local face upsampling network. arXiv:1603.07235. Retrieved from https://arxiv.org/abs/1603.07235.
[90]
T. Lu, J. Wang, J. Jiang, and Y. Zhang. 2020. Global-local fusion network for face super-resolution. Neurocomputing 387 (2020), 309–320.
[91]
K. Jiang, Z. Wang, P. Yi, T. Lu, J. Jiang, and Z. Xiong. 2020. Dual-path deep fusion network for face image hallucination. IEEE Trans. Neural Netw. Learn. Syst. IEEE, 1–1. DOI:https://doi.org/10.1109/TNNLS.2020.3027849.
[92]
S. Ko and B. R. Dai. 2021. Multi-laplacian GAN with edge enhancement for face super resolution. In Proceedings of the 25th International Conference on Pattern Recognition (ICPR’21). IEEE, 3505–3512.
[93]
L. Yang, P. Wang, Z. Gao, S. Wang, P. Ren, S. Ma, and W. Gao. 2020. Implicit subspace prior learning for dual-blind face restoration. arXiv:2010.05508. Retrieved from https://arxiv.org/abs/2010.05508.
[94]
Y. Luo and K. Huang. 2020. Super-resolving tiny faces with face feature vectors. In Proceedings of the 10th International Conference on Information Science and Technology (ICIST’20). IEEE, 145–152.
[95]
S. D. Indradi, A. Arifianto, and K. N. Ramadhani. 2019. Face image super-resolution using inception residual network and GAN framework. In Proceedings of the 7th International Conference on Information and Communication Technology (ICoICT’19). IEEE, 1–6.
[96]
Z. Chen and Y. Tong. 2017. Face super-resolution through wasserstein GANs. arXiv:1705.02438. Retrieved from https://arxiv.org/abs/1705.02438.
[97]
B. Huang, W. Chen, X. Wu, and C. L. Lin. 2018. High-quality face image generated with conditional boundary equilibrium generative adversarial networks. Pattern Recogn. Lett. 111, (2018), 72–79.
[98]
H. Dou, C. Chen, X. Hu, Z. Xuan, Z. Hu, and S. Peng. 2020. PCA-SRGAN: Incremental orthogonal projection discrimination for face super-resolution. In Proceedings of the 28th ACM International Conference on Multimedia (ACM MM’20). ACM, 1891–1899.
[99]
M. Zhang and Q. Ling. 2020. Supervised Pixel-wise GAN for Face Super-resolution. IEEE Trans. Multimedia 23 (2020), 1938–1950.
[100]
K. Grm, M. Pernus, L. Cluzel, W. Scheirer, S. Dobrisek, and V. Struc. 2019. Face hallucination revisited: An exploratory study on dataset bias. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW’19). IEEE, 2405–2413.
[101]
A. Aakerberg, K. Nasrollahi, and T. B. Moeslund. 2021. Real-world super-resolution of face-images from surveillance cameras. arXiv:2102.03113. Retrieved from https://arxiv.org/abs/2102.03113.
[102]
A. Bulat, Y. Jing, and G. Tzimiropoulos. 2018. To learn image super-resolution, use a GAN to learn how to do image degradation first. In Proceedings of the European Conference on Computer Vision (ECCV’18). Springer International Publishing, 187–202.
[103]
S. Goswami, Aakanksha, and A. N. Rajagopalan. 2020. Robust super-resolution of real faces using smooth features. In Proceedings of the European Conference on Computer Vision Workshops (ECCVW.18). Springer International Publishing, 169–185.
[104]
W. Zheng, L. Yan, W. Zhang, C. Gou, and F. Wang. 2019. Guided cyclegan via semi-dual optimal transport for photo-realistic face super-resolution. In Proceedings of the IEEE International Conference on Image Processing (ICIP’19). IEEE, 2851–2855.
[105]
Z. Cheng, X. Zhu, and S. Gong. 2020. Characteristic regularisation for super-resolving face images. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV’20). IEEE, 2424–2433.
[106]
T. Karras, S. Laine, M. Aittala, J. Hellsten, J. Lehtinen, and T. Aila. 2020. Analyzing and improving the image quality of StyleGAN. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’20). IEEE, 8107–8116.
[107]
T. Karras, T. Aila, S. Laine, and J. Lehtinen. 2020. Analyzing and improving the image quality of StyleGAN. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’20). IEEE.
[108]
Y. Choi, M. Choi, M. Kim, J. W. Ha, S. Kim, and J. Choo. 2018. Stargan: Unified generative adversarial networks for multi-domain image-to-image translation. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR’18). 8789–8797.
[109]
S. Menon, A. Damian, S. Hu, N. Ravi, and C. Rudin. 2020. PULSE: Self-supervised photo upsampling via latent space exploration of generative models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’20). IEEE, 2223–2232.
[110]
K. C. K. Chan, X. Wang, X. Xu, J. Gu, and C. C. Loy. 2021. GLEAN: Generative latent bank for large-factor image super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’21). 14245–14254.
[111]
X. Wang, Y. Li, H. Zhang, and Y. Shan. 2021. Towards real-world blind face restoration with generative facial prior. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’21). 9168–9178.
[112]
T. Yang, P. Ren, X. Xie, and L. Zhang. 2021. GAN prior embedded network for blind face restoration in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’21). 672–681.
[113]
Y. Shi, G. Li, Q. Cao, K. Wang, and L. Lin. 2020. Face hallucination by attentive sequence optimization with reinforcement learning. IEEE Trans. Pattern Anal. Mach. Intell. 42, 11 (2020), 2809–2824.
[114]
K. Jiang, Z. Wang, P. Yi, G. Wang, K. Gu, and J. Jiang. 2020. Atmfn: Adaptive-threshold-based multi-model fusion network for compressed face hallucination. IEEE Trans. Multimedia 22, 10 (2020), 2734–2747.
[115]
Y. Song, J. Zhang, S. He, L. Bao, and Q. Yang. 2017. Learning to hallucinate face images via component generation and enhancement. In Proceedings of the 26th International Joint Conference on Artificial Intelligence (IJCAI’17). 4537–4543.
[116]
J. Jiang, Y. Yu, J. Hu, S. Tang, and J. Ma. 2018. Deep CNN denoiser and multi-layer neighbor component embedding for face hallucination. In Proceedings of the 27th International Joint Conference on Artificial Intelligence (IJCAI’18). 771–778.
[117]
C. Chen, X. Li, L. Yang, X. Lin, and K. Wong. 2021. Progressive semantic-aware style transformation for blind face restoration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 11896–11905.
[118]
X. Yu, L. Zhang, and W. Xie. 2021. Semantic-driven face hallucination based on residual network. IEEE Trans. Biom. Behav. Identity Sci. 3, 2 (2021), 214–228.
[119]
C. Wang, Z. Zhong, J. Jiang, D. Zhai, and X. Liu. 2020. Parsing map guided multi-scale attention network for face hallucination. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’20). IEEE, 2518–2522.
[120]
X. Hu, W. Ren, J. Lamaster, X. Cao, X. Li, Z. Li, B. Menze, and W. Liu. 2020. Face super-resolution guided by 3D facial priors. In Proceedings of the European Conference on Computer Vision (ECCV’20). Springer International Publishing, 763–780.
[121]
S. Zhu, S. Liu, C. L. Chen, and X. Tang. 2016. Deep cascaded bi-network for face hallucination. In Proceedings of the European Conference on Computer Vision Workshops (ECCV’16). Springer International Publishing, 614–630.
[122]
K. Li, B. Bare, B. Yan, B. Feng, and C. Yao. 2018. Face hallucination based on key parts enhancement. 30 (2018), In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’18). IEEE, 1378–1382.
[123]
Y. Yin, J. P. Robinson, Y. Zhang, and Y. Fu. 2020. Joint super-resolution and alignment of tiny faces. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI’19). 12693–12700.
[124]
M. Li, Z. Zhang, J. Yu, and C. W. Chen. 2021. Learning face image super-resolution through facial semantic attribute transformation and self-attentive structure enhancement. IEEE Trans. Multimedia 23 (2021), 468–483.
[125]
X. Yu, B. Fernando, B. Ghanem, F. Porikli, and R. Hartley. 2018. Face super-resolution guided by facial component heatmaps. In Proceedings of the European Conference on Computer Vision (ECCV’18). Springer International Publishing, 219–235.
[126]
A. Bulat and G. Tzimiropoulos. 2018. Super-FAN: Integrated facial landmark localization and super-resolution of real-world low resolution faces in arbitrary poses with GANs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’18). IEEE, 109–117.
[127]
D. Kim, M. Kim, G. Kwon, and D. S. Kim. 2019. Progressive Face Super-Resolution via Attention to Facial Landmark. In Proceedings of the he British Machine Vision Conference (BMVC’19). BMVA, 1–12.
[128]
T. Zhao and C. Zhang. 2020. SAAN: Semantic attention adaptation network for face super-resolution. In Proceedings of the IEEE International Conference on Multimedia and Expo (ICME’20). IEEE, 1–6.
[129]
C. Wang, J. Jiang, and X. Liu. 2021. Heatmap-aware pyramid face hallucination. In Proceedings of the IEEE International Conference on Multimedia and Expo (ICME’21). IEEE, 1–6.
[130]
J. Li, B. Bare, S. Zhou, B. Yan, and K. Li. 2021. Organ-branched CNN for robust face super-resolution. In Proceedings of the IEEE International Conference on Multimedia and Expo (ICME’21). IEEE, 1–6.
[131]
Z. S. Liu, W. C. Siu, and Y. L. Chan. 2021. Features guided face super-resolution via hybrid model of deep learning and random forests. IEEE Trans. Image Process. 30 (2021), 4157–4170.
[132]
M. Li, Y. Sun, Z. Zhang, and J. Yu. 2018. A coarse-to-fine face hallucination method by exploiting facial prior knowledge. In Proceedings of the 25th IEEE International Conference on Image Processing (ICIP’18). IEEE, 61–65.
[133]
Y. Zhang, Y. Wu, and L. Chen. 2020. MSFSR: A multi-stage face super-resolution with accurate facial representation via enhanced facial boundaries. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW’20). IEEE, 2120–2129.
[134]
H. Wang, Q. Hu, C. Wu, J. Chi, X. Yu, and H. Wu. 2021. Dclnet: Dual closed-loop networks for face super-resolution. Knowl.-Bas. Syst. 222 (2021), 106987.
[135]
S. Liu, C. Xiong, X. Shi, and Z. Gao. 2021. Progressive face super-resolution with cascaded recurrent convolutional network. Neurocomputing 449 (2021), 357–367.
[136]
S. Liu, C. Xiong, and Z. Gao. 2021. Face super-resolution network with incremental enhancement of facial parsing information. In Proceedings of the 25th International Conference on Pattern Recognition (ICPR’21). IEEE, 7537–7543.
[137]
L. Li, J. Tang, Z. Ye, B. Sheng, L. Mao, and L. Ma. 2021. Unsupervised face super-resolution via gradient enhancement and semantic guidance. Vis. Comput. 37, 9–11 (2021), 2855–2867.
[138]
Y. Lu, Y. W. Tai, and C. K. Tang. 2018. Attribute-guided face generation using conditional CycleGAN. In Proceedings of the European Conference on Computer Vision (ECCV’18). Springer International Publishing, 293–308.
[139]
X. Yu, B. Fernando, R. Hartley, and F. Porikli. 2018. Super-resolving very low-resolution face images with supplementary attributes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’18). IEEE, 908–917.
[140]
M. Li, Y. Sun, Z. Zhang, H. Xie, and J. Yu. 2019. Deep learning face hallucination via attributes transfer and enhancement. In Proceedings of the IEEE International Conference on Multimedia and Expo (ICME’19). IEEE, 604–609.
[141]
C. H. Lee, K. Zhang, H. C. Lee, C. W. Cheng, and W. Hsu. 2018. Attribute augmented convolutional neural network for face hallucination. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW’18). IEEE, 721–729.
[142]
X. Yu, B. Fernando, R. Hartley, and F. Porikli. 2020. Semantic face hallucination: Super-resolving very low-resolution face images with supplementary attributes. IEEE Trans. Pattern Anal. Mach. Intell. 42, 11 (2020), 2926–2943.
[143]
J. Xin, N. Wang, X. Gao, and J. Li. 2019. Residual attribute attention network for face image super-resolution. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI’19). 9054–9061.
[144]
J. Xin, N. Wang, X. Jiang, J. Li, X. Gao, and Z. Li. 2020. Facial attribute capsules for noise face super resolution. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI’20). 12476–12483.
[145]
K. Zhang, Z. Zhang, C. W. Cheng, W. H. Hsu, Y. Qiao, W. Liu, and T. Zhang. 2018. Super-identity convolutional neural network for face hallucination. In Proceedings of the European Conference on Computer Vision (ECCV’18). 183–198.
[146]
B. Bayramli, U. Ali, T. Qi, and H. Lu. 2019. FH-GAN: Face hallucination and recognition using generative adversarial network. In Neural Information Processing. Springer International Publishing, 3–15.
[147]
H. Huang, R. He, Z. Sun, and T. Tan. 2019. Wavelet domain generative adversarial network for multi-scale face hallucination. Int. J. Comput. Vis. 127, 6–7 (2019), 763–784.
[148]
S. Lai, C. He, and K. Lam. 2019. Low-resolution face recognition based on identity-preserved face hallucination. In Proceedings of the IEEE International Conference on Image Processing (ICIP’19). IEEE, 1173–1177.
[149]
X. Cheng, J. Lu, B. Yuan, and J. Zhou. 2020. Identity-preserving face hallucination via deep reinforcement learning. IEEE Trans. Circ. Syst. Vid. Technol.4796–4809.
[150]
K. Grm, W. J. Scheirer, and V. Štruc. 2020. Face hallucination using cascaded super-resolution and identity priors. IEEE Trans. Image Process. 29 (2020), 2150–2165.
[151]
A. A. Abello and R. Hirata. 2019. Optimizing super resolution for face recognition. In Proceedings of the 32nd SIBGRAPI Conference on Graphics, Patterns and Images (SIBGRAPI’19). IEEE, 194–201.
[152]
F. Cheng, T. Lu, Y. Wang, and Y. Zhang. 2021. Face super-resolution through dual-identity constraint. In Proceedings of the IEEE International Conference on Multimedia and Expo (ICME’21). IEEE, 1–6.
[153]
J. Kim, G. Li, I. Yun, C. Jung, and J. Kim. 2021. Edge and identity preserving network for face super-resolution. Neurocomputing 446 (2021), 11–22.
[154]
E. Ataer-Cansizoglu, M. Jones, Z. Zhang, and A. Sullivan. 1903. Verification of very low-resolution faces using an identity-preserving deep face super-resolution network. arXiv:1903.10974. Retrieved from https://arxiv.org/abs/1903.10974.
[155]
J. Chen, J. Chen, Z. Wang, C. Liang, and C. W. Lin. 2020. Identity-aware face super-resolution for low-resolution face recognition. IEEE Sign. Process Lett. 27 (2020), 645–649.
[156]
W. Liu, Y. Wen, Z. Yu, M. Li, B. Raj, and L. Song. 2017. Sphereface: Deep hypersphere embedding for face recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). IEEE,212–220.
[157]
C. Hsu, C. Lin, W. Su, and G. Cheung. 2019. Sigan: Siamese generative adversarial network for identity-preserving face hallucination. IEEE Trans. Image Process. 28, 12 (2019), 6225–6236.
[158]
H. Kazemi, F. Taherkhani, and N. M. Nasrabadi. 2019. Identity-aware deep face hallucination via adversarial face verification. In Proceedings of the IEEE 10th International Conference on Biometrics Theory, Applications and Systems (BTAS’19). IEEE, 1–10.
[159]
B. Dogan, S. Gu, and R. Timofte. 2019. Exemplar guided face image super-resolution without facial landmarks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW’19). IEEE, 1814–1823.
[160]
K. Wang, J. Oramas, and T. Tuytelaars. 2021. Multiple exemplars-based hallucination for face super-resolution and editing. In Proceedings of the Asia Conference on Computer Vision (ACCV’20).Springer International Publishing, 258–273.
[161]
X. Li, G. Duan, Z. Wang, J. Ren, Y. Zhang, J. Zhang, and K. Song. 2019. Recovering extremely degraded faces by joint super-resolution and facial composite. In International Conference on Tools with Artificial Intelligence (ICTAI’19). 524–530.
[162]
S. Schaefer, T. Mcphail, and J. Warren. 2006. Image deformation using moving least squares. In ACM Papers on SIGGRAPH’06. ACM Press, 533–540.
[163]
X. Huang and S. Belongie. 2017. Arbitrary style transfer in real-time with adaptive instance normalization. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’17). IEEE, 1501–1510.
[164]
T. Baltrusaitis, P. Robinson, and L. P. Morency. 2013. Constrained local neural fields for robust facial landmark detection in the wild. In Proceedings of the IEEE International Conference on Computer Vision Workshops (ICCVW’13). IEEE, 354–361.
[165]
T. Baltrusaitis, A. Zadeh, Y. C. Lim, and L. P. Morency. 2018. Openface 2.0: Facial behavior analysis toolkit. In Proceedings of the 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG’18). IEEE, 59–66.
[166]
A. Zadeh, C. L. Yao, T. Baltruaitis, and L. P. Morency. 2017. Convolutional experts constrained local model for 3D facial landmark detection. In Proceedings of the IEEE International Conference on Computer Vision Workshops (ICCVW’17). IEEE, 2519–2528.
[167]
C. Yu, J. Wang, C. Peng, C. Gao, G. Yu, and N. Sang. 2018. Bisenet: Bilateral segmentation network for real-time semantic segmentation. In Proceedings of the European Conference on Computer Vision (ECCV’18). Springer International Publishing, 334–349.
[168]
Y. Zhang, K. Li, K. Li, L. Wang, B. Zhong, and Y. Fu. 2018. Image super-resolution using very deep residual channel attention networks. In Proceedings of the European Conference on Computer Vision (ECCV’18). 286–301.
[169]
Y. Mei, Y. Fan, and Y. Zhou. 2020. Image super-resolution with cross-scale non-local attention and exhaustive self-exemplars mining. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’20). IEEE, 3517–3526.
[170]
L. Yang, B. Shao, T. Sun, S. Ding, and X. Zhang. 2018. Hallucinating very low-resolution and obscured face images. arXiv:1811.04645. Retrieved from https://arxiv.org/abs/1811.04645.
[171]
J. Cai, H. Hu, S. Shan, and X. Chen. 2020. FCSR-GAN: Joint face completion and super-resolution via multi-task learning. IEEE Trans. Biom. Behav. Identity Sci. 2, 2 (2020), 109–121.
[172]
Z. Liu, Y. Wu, L. Li, C. Zhang, and B. Wu. 2020. Joint face completion and super-resolution using multi-scale feature relation learning. arXiv:2003.00255. Retrieved from https://arxiv.org/abs/2003.00255.
[173]
Y. Zhang, X. Yu, X. Lu, and P. Liu. Pro-uigan: Progressive face hallucination from occluded thumbnails. arXiv:2108.00602. Retrieved from https://arxiv.org/abs/2108.00602.
[174]
X. Xu, D. Sun, J. Pan, Y. Zhang, H. Pfister, and M. H. Yang. 2017. Learning to super-resolve blurry face and text images. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’17). IEEE, 251–260.
[175]
Y. Song, J. Zhang, L. Gong, S. He, L. Bao, J. Pan, Q. Yang, and M. H. Yang. 2019. Joint face hallucination and deblurring via structure generation and detail enhancement. Int. J. Comput. Vis. 127, 6–7 (2019), 785–800.
[176]
C. H. Yang and L. W. Chang. 2020. Deblurring and super-resolution using deep gated fusion attention networks for face images. In Proceedings of the 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’20). IEEE, 1623–1627.
[177]
Y. Xu, H. Zou, Y. Huang, L. Jin, and H. Ling. 2021. Super-resolving blurry face images with identity preservation. Pattern Recogn. Lett. 146 (2021), 158–164.
[178]
H. A. Le and I. A. Kakadiaris. 2019. SeLENet: A semi-supervised low light face enhancement method for mobile face unlock. In Proceedings of the International Conference on Biometrics (ICB’19). IEEE, 1–8.
[179]
X. Ding and R. Hu. 2020. Learning to see faces in the dark. In Proceedings of the IEEE International Conference on Multimedia and Expo (ICME’20). IEEE, 1–6.
[180]
Y. Zhang, T. Tsang, Y. Luo, C. Hu, X. Lu, and X. Yu. 2020. Copy and paste GAN: Face hallucination from shaded thumbnails. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’20). IEEE, 7355–7364.
[181]
Y. Zhang, I. Tsang, Y. Luo, C. Hu, X. Lu, and X. Yu. 2021. Recursive copy and paste GAN: Face hallucination from shaded thumbnails. IEEE Trans. Pattern Anal. Mach. Intell. IEEE, 1–1. DOI:https://doi.org/10.1109/TPAMI.2021.3061312.
[182]
R. Yasarla, H. Joze, and V. M. Patel. 2021. Network architecture search for face enhancement. arXiv:2105.06528. Retrieved from https://arxiv.org/abs/2105.06528.
[183]
M. Jaderberg, K. Simonyan, A. Zisserman, and K. Kavukcuoglu. 2015. Reading text in the wild with convolutional neural networks. Int. J. Comput. Vis. 116, 1 (2015), 1–20.
[184]
X. Yu, F. Porikli, B. Fernando, and R. Hartley. 2020. Hallucinating unaligned face images by multiscale transformative discriminative networks. Int. J. Comput. Vis. 128, 2 (2020), 500–526.
[185]
X. Yu and F. Porikli. 2017. Hallucinating very low-resolution unaligned and noisy face images by transformative discriminative autoencoders. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR’17). 3760–3768.
[186]
Y. Zhang, I. W. Tsang, J. Li, P. Liu, X. Lu, and X. Yu. 2021. Face hallucination with finishing touches. IEEE Trans. Image Process. 30 (2021), 1728–1743.
[187]
X. Yu, F. Shiri, B. Ghanem, and F. Porikli. 2020. Can we see more?: Joint frontalization and hallucination of unaligned tiny faces. IEEE Trans. Pattern Anal. Mach. Intell. 42, 9 (2020), 2148–2164.
[188]
X. Tu, J. Zhao, Q. Liu, W. Ai, G. Guo, Z. Li, W. Liu, and J. Feng. 2021. Joint face image restoration and frontalization for recognition (unpublished).
[189]
D. Li and Z. Wang. 2017. Face video super-resolution with identity guided generative adversarial networks. In Communications in Computer and Information Science. Springer, Singapore, 357–369.
[190]
E. Ataer-Cansizoglu and M. Jones. 2018. Super-resolution of very low-resolution faces from videos. In Proceedings of the British Machine Vision Conference (BMVC). BMVA Press, 1–13.
[191]
J. Xin, N. Wang, J. Li, X. Gao, and Z. Li. 2020. Video face super-resolution with motion-adaptive feedback cell. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI’20). AAAI, 12468–12475.
[192]
C. Fang, G. Li, X. Han, and Y. Yu. 2020. Self-enhanced convolutional network for facial video hallucination. IEEE Trans. Image Process. 29 (2020), 3078–3090.
[193]
Y. Gan, Y. Luo, X. Yu, B. Zhang, and Y. Yang. 2021. Vidface: A full-transformer solver for video facehallucination with unaligned tiny snapshots. arXiv:2105.14954. Retrieved from https://arxiv.org/abs/2105.14954.
[194]
X. Zhang and X. Wu. 2021. Multi-modality deep restoration of extremely compressed face videos. arXiv:2107.05548. Retrieved from https://arxiv.org/abs/2107.05548.
[195]
Z. Wan, B. Zhang, D. Chen, P. Zhang, D. Chen, J. Liao, and F. Wen. 2020. Bringing old photos back to life. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’20). IEEE, 2747–2757.
[196]
T. H. Oh, T. Dekel, C. Kim, I. Mosseri, W. T. Freeman, M. Rubinstein, and W. Matusik. 2019. Speech2face: Learning the face behind a voice. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’19). 7539–7548.
[197]
G. Meishvili, S. Jenni, and P. Favaro. 2020. Learning to have an ear for face super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’20). IEEE, 1364–1374.
[198]
G. Pan, S. Han, Z. Wu, and Y. Wang. 2006. Super-resolution of 3D face. In Proceedings of the European Conference on Computer Vision (ECCV’06). Springer, Berlin, 389–401.
[199]
S. Berretti, A. Del Bimbo, and P. Pala. 2012. Superfaces: A super-resolution model for 3D faces. In Proceedings of the European Conference on Computer Vision (ECCV’12).Springer, Berlin, 73–82.
[200]
L. Shu, I. Kemelmacher-Shlizerman, and L. G. Shapiro. 2014. 3D Face Hallucination from a Single Depth Frame. 31–38.
[201]
C. Qu, C. Herrmann, E. Monari, T. Schuchert, and J. Beyerer. 2017. Robust 3D patch-based face hallucination. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV’17). IEEE, 1105–1114.
[202]
J. Li, F. Zhu, X. Yang, and Q. Zhao. 2021. 3D face point cloud super-resolution network. In Proceedings of the IEEE International Joint Conference on Biometrics (IJCB’21). IEEE, 1–8.
[203]
H. Chen, Y. Wang, T. Guo, C. Xu, Y. Deng, Z. Liu, S. Ma, C. Xu, C. Xu, and W. Gao. 2021. Pre-trained image processing transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’21). 12299–12310.
[204]
J. Liang, J. Cao, G. Sun, K. Zhang, Van G. L., and R. Timofte. 2021. SwinIR: Image restoration using swin transformer In Proceedings of the IEEE International Conference on Computer Vision Workshops (ICCV Workshops’21). IEEE, 1833–1844.
[205]
S. Goswami and A. N. Rajagopalan. 2020. Robust super-resolution of real faces using smooth features. In Proceedings of the European Conference on Computer Vision (ECCV’20). Springer International Publishing, 169–185.
[206]
A. Aakerberg, K. Nasrollahi, and T. B. Moeslund. 2021. Real-world super-resolution of face-images from surveillance cameras. 2102.03113. Retrieved from https://arxiv.org/abs/2102.03113.
[207]
J. Jiang, C. Wang, X. Liu, and J. Ma. 2021. Spectral splitting and aggregation network for hyperspectral face super-resolution. arXiv:2108.13584. Retrieved from https://arxiv.org/abs/2108.13584.

Cited By

View all
  • (2024)Temporal Super-Resolution Using a Multi-Channel Illumination SourceSensors10.3390/s2403085724:3(857)Online publication date: 28-Jan-2024
  • (2024)Neural network methods for radiation detectors and imagingFrontiers in Physics10.3389/fphy.2024.133429812Online publication date: 22-Feb-2024
  • (2024)Numerical and Clinical Evaluation of the Robustness of Open-source Networks for Parallel MR Imaging ReconstructionMagnetic Resonance in Medical Sciences10.2463/mrms.mp.2023-003123:4(460-478)Online publication date: 2024
  • Show More Cited By

Index Terms

  1. Deep Learning-based Face Super-resolution: A Survey

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Computing Surveys
      ACM Computing Surveys  Volume 55, Issue 1
      January 2023
      860 pages
      ISSN:0360-0300
      EISSN:1557-7341
      DOI:10.1145/3492451
      Issue’s Table of Contents
      This work is licensed under a Creative Commons Attribution International 4.0 License.

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 23 November 2021
      Accepted: 01 August 2021
      Revised: 01 August 2021
      Received: 01 March 2021
      Published in CSUR Volume 55, Issue 1

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Face super-resolution
      2. deep learning
      3. survey
      4. facial characteristics

      Qualifiers

      • Survey
      • Refereed

      Funding Sources

      • National Natural Science Foundation of China
      • Fundamental Research Funds for the Central Universities

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)3,754
      • Downloads (Last 6 weeks)408
      Reflects downloads up to 14 Oct 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Temporal Super-Resolution Using a Multi-Channel Illumination SourceSensors10.3390/s2403085724:3(857)Online publication date: 28-Jan-2024
      • (2024)Neural network methods for radiation detectors and imagingFrontiers in Physics10.3389/fphy.2024.133429812Online publication date: 22-Feb-2024
      • (2024)Numerical and Clinical Evaluation of the Robustness of Open-source Networks for Parallel MR Imaging ReconstructionMagnetic Resonance in Medical Sciences10.2463/mrms.mp.2023-003123:4(460-478)Online publication date: 2024
      • (2024)Restoration of Semantic-Based Super-Resolution Aerial ImagesВосстановление аэрофотоснимков сверхвысокого разрешения с учетом семантических особенностейInformatics and AutomationИнформатика и автоматизация10.15622/ia.23.4.523:4(1047-1076)Online publication date: 26-Jun-2024
      • (2024)EmAGAN: Embedded Blocks Search and Mask Attention GAN for Makeup TransferACM Multimedia Asia 202310.1145/3595916.3626743(1-5)Online publication date: Jan-2024
      • (2024)Rethinking Prior-Guided Face Super-Resolution: A New Paradigm With Facial Component PriorIEEE Transactions on Neural Networks and Learning Systems10.1109/TNNLS.2022.320144835:3(3938-3952)Online publication date: Mar-2024
      • (2024)Learning Compact Hyperbolic Representations of Latent Space for Old Photo RestorationIEEE Transactions on Image Processing10.1109/TIP.2024.340459333(3578-3589)Online publication date: 30-May-2024
      • (2024)RGTGAN: Reference-Based Gradient-Assisted Texture-Enhancement GAN for Remote Sensing Super-ResolutionIEEE Transactions on Geoscience and Remote Sensing10.1109/TGRS.2024.335909562(1-21)Online publication date: 2024
      • (2024)SPADNet: Structure Prior-Aware Dynamic Network for Face Super-ResolutionIEEE Transactions on Biometrics, Behavior, and Identity Science10.1109/TBIOM.2024.33828706:3(326-340)Online publication date: Jul-2024
      • (2024)D-LORD: DYSL-AI Database for Low-Resolution Disguised Face RecognitionIEEE Transactions on Biometrics, Behavior, and Identity Science10.1109/TBIOM.2023.33067036:2(147-157)Online publication date: Apr-2024
      • Show More Cited By

      View Options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format.

      HTML Format

      Get Access

      Login options

      Full Access

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media