1 Introduction
In the past few decades, much attention has been paid to the use of 3D data in facial image processing applications. This technology has shown to be promising for robust facial feature extraction [
10,
51,
189]. In uncontrolled environments, it limits the effects of adverse factors such as unfavorable illumination conditions and the non-frontal poses of the face with respect to the camera [
51,
148,
176].
Among the various scenarios, developing personal recognition based on 3D data appears to be a “hot topic” due to the accuracy and efficiency obtainable from comparing faces, thanks to the complementary information of shape and texture [
12,
16,
97]. However, acquiring such data requires expensive hardware; moreover, the enrollment process is much more complex [
143,
148,
184,
219,
225]. Thus, face recognition technology was mainly developed in the 2D domain. The acquisition of 2D images is more straightforward than that of 3D ones, as it does not require specific hardware, but often makes the recognition task challenging due to the significant variability in facial appearance [
35,
148].
3D face reconstruction (3DFR) from 2D images and videos may overcome these limits, combining the ease of acquiring 2D data with the robustness of 3D ones (Figure
1).
One of the possible fields that could benefit from these advantageous characteristics is that of forensics, which often deals with probe images of unidentified people’s faces in non-frontal view, in uncontrolled environments, and in an uncooperative way, such as in the case of the ones captured by
CCTV (Closed-Circuit Television) cameras. Despite some frameworks for the acquisition of 3D face models of suspects that have been proposed (e.g., Reference [
126]), in such context, it is still common to have 2D mugshots, that is, frontal and, usually, profile images of subjects routinely captured by law enforcement agencies [
131] for the recognition of people of interest, such as suspects or witnesses (Figure
2).
Unfortunately, a reference gallery composed of frontal and profile images is not able to provide effective coverage of all possible conditions, such as in the case of a probe image in an arbitrary pose that is not at the same view angle as in one of the available mugshot images [
230]. Therefore, from the first attempt at face recognition from mugshots [
210], 3D reconstruction techniques were exploited, too, for facing some of the issues that are typical of the considered forensic cases, trying to establish the identity of unknown individuals against a reference dataset of known individuals, either in verification mode (1 to 1) or identification mode (1 to N). Hence, the research community proposed to employ this approach in facial recognition from probe videos and images acquired in an unconstrained environment to provide more information about the individual faces through the generation of multiple views or the “correction” of the pose in probe data. This makes the comparison with reference data more robust to various appearance variations typical of forensic cases.
In particular, to be suitable for real-world forensic applications, any system of this kind should satisfy strict constraints leading to the legal validity of the conclusions during a lawsuit or in the investigation phase [
27,
110]. For this reason, it is necessary to analyze the methods that employ 3DFR to shed some light on their admissibility in the forensic scenario. Although other authors investigated the state-of-the-art of 3DFR from 2D images or videos [
61,
73,
148,
234] and its applications to face recognition [
61,
148,
156], none of them considered the requirements they have to satisfy to be potentially employed in such context and how forensics can benefit from their adoption. Moreover, the validity of the proposed face recognition systems in the considered application scenarios strongly depends on the datasets on which they have been evaluated, since these provide a basis for measuring and comparing their performance with state-of-the-art. In other words, data representativeness is fundamental, and the algorithms’ adoption is bounded by the available data [
40,
174].
A specific investigation highlighting the potential and limits of 3D facial reconstruction in forensics is still missing, and, in our opinion, it would be necessary to direct research toward its real-world application. To pursue this goal, this work analyzes the potentiality of the employment of 3D face reconstruction in forensics and the approaches proposed by the research community for its integration in a common face recognition casework while considering the core challenges of legal admissibility of automated systems including it. The central premise of this work is to shed some light on the requirements that should be satisfied to fill the gap between biometric recognition and forensic comparison when reconstructing a facial image into 3D space for the recognition of an individual from 2D videos or images. The investigation of the potential benefit of this technique to forensics is the aim of our work.
This article is the follow-up of Reference [
123], which is a first step toward the objectives listed above. To our knowledge, it represents the first investigation focused on state-of-the-art in applications and potentialities of 3D face reconstruction in forensics and the novelties introduced to date (Figure
3), as well as the requirements that any of the related systems must satisfy to be considered admissible in criminal investigations or judicial cases. With respect to Reference [
123], this article extends such disquisition, especially in relation to the comparison among the proposed methods and the admissibility constraints that have to be satisfied to be effectively integrated into the reference scenario. Moreover, this survey also provides an analysis of the datasets employed in the reviewed studies, which could further highlight their strengths and limits, suggesting their uses in the design and evaluation of forensic facial recognition algorithms and the potential issues. Finally, some state-of-the-art datasets that could be alternative or complementary to those already used are proposed and analyzed as well to provide suitable ground truth for future studies, with the main focus on the types of data so far considered, namely, facial images, videos, and 3D scans of the face.
The article’s structure is as follows: Section
2 analyzes the relationship between forensics and biometrics, mainly focusing on facial traits and the integration of 3DFR. The state-of-the-art assessment of 3DFR methods for face recognition from mugshot images is reported in Section
3. A review of other proposed forensic-related applications of 3DFR from facial images and videos is carried out in Section
4. Section
5 explores the underlying datasets of facial images, videos, and 3D scans, proposing others that could be suitable as well for future research on the analyzed topic. Finally, Section
6 discusses how all the aspects above converge in a unified view.
2 Face Recognition and Forensics
The face represents a valuable clue in many criminal investigations due to its advantageous characteristics with respect to other biometrics [
109,
164] and the growing number of surveillance cameras in both private and public places [
52,
102,
140]. Over the years, various methods have been proposed to check whether the individual’s identity in a probe image or video matches that of a person of interest, namely, an individual related to the event under investigation, such as a suspect, a victim, or a witness. In particular, these represent a subset of the approaches widely explored in traditional biometric recognition and implemented in the related automated face recognition systems [
109,
120,
185]. These methods can be summarized into various qualitative or quantitative examination approaches, which can be employed or are preferred under different conditions [
60,
62].
A first approach processes the face globally in a holistic form. However, it is recommended only if other more effective approaches are not suitable, and it is highly inaccurate when faces belong to unfamiliar people, in the case of partially occluded faces [
32,
62,
64,
216,
226] or severely distorted CCTV footage [
34].
A second approach is based on a set of facial fiducial points named landmarks [
28,
49] and employed to derive the distances and proportions between facial features. This choice is not generally recommended as well due to the subjectivity in their manual estimation in uncontrolled images due to adverse factors such as the large pose of the head, the distance from the camera, facial expressions, and lighting conditions [
62,
118,
150,
151,
208]. Some of these issues could be mitigated by means of preprocessing techniques (e.g., super-resolution methods [
101]).
A third approach is that of superimposition. It allows handling the discrepancies arising from differences in the position of the face with respect to the camera in two different aligned images or videos. To achieve this goal, it combines them through various methods, such as a reduced opacity overlay or blinking quickly between them. This approach is unreliable when comparing data acquired in uncontrolled scenarios, even in previous judicial cases [
5,
24,
62,
137,
150,
192,
193,
226].
A fourth approach is that of morphological comparison, in which a generally predefined list of facial regions and features extracted from them related to shape, appearance, presence, and/or location, such as the relative width of the mouth with respect to the distance between the eyes and the asymmetry of the mouth [
83], are compared to determine differences and similarities between the probe and reference data [
226].
In particular, the latter approach is able to improve the identification accuracy by examiners, even thanks to the higher physical stability over time with respect to many of both photoanthropometry and holistic features [
86,
151]. However, the stability of the evaluated features could also be affected by extrinsic factors, such as lighting and the position of the subject’s face with respect to the camera, which can introduce different levels of variability, contributing to the unreliability of certain features [
119,
151,
224].
Despite the differences in reliability and acceptance, these approaches are not alternatives to each other. The choice among them is generally dependent on the probe image or videos, and they can even be used jointly in the identification task to carry out a more exhaustive analysis [
15,
108,
151]. Furthermore, even if these approaches could not be used as evidence in a confirmatory identification due to the acquisition condition of the probe image or video, these could still be employed in an attempt to exclude possible suspects or be a limited—but not worthless—support for reaching a conclusion through other evidence [
84,
119,
137,
151].
Although both biometric recognition and forensic identification seek to link evidence to a particular individual [
112], research in these fields has been pursued independently for many years due to their different goals and requirements, as well as the difficulties in achieving significant scientific contributions in this cross-domain research field [
123]. Thus, despite the employment of approaches that are common between them, the underlying methods and the automated systems integrating them must satisfy strict constraints to be considered suitable for forensic casework.
2.1 Automated Forensic Facial Recognition: The Italian Case
Due to the stringent requirements of the analyzed field, automatic recognition systems are only recently being introduced. For example, in 2017, the Italian police bodies introduced the ordinary use of an automatic image recognition system,
S.A.R.I. (from the Italian
“Sistema Automatico di Riconoscimento delle Immagini”), as an innovative tool aimed at supporting investigative activities [
17,
173]. This system allows automatically comparing a facial probe image with millions of mugshots to reduce the number of candidates, which are then ordered by the similarity degree. Furthermore, the system is also able to work in real-time on a gallery on the order of hundreds of thousands of individuals to enforce security and control on the territory. The SARI’s outcome is a set of potential candidates that must be examined by the specialized experts of the scientific police in charge of verifying the process [
22,
173]. Due to the stringent requirements of the analyzed field, automatic recognition systems are only recently being exploited. Despite the effectiveness and the extreme speed of this automatic system, it cannot yet be used in the criminal field, as it does not allow access and repetition of recognition by the defense, thus precluding cross-examination of the specific functioning of the software in question [
38,
70,
173,
179]. Moreover, its functioning lacks the transparency required for any criminal case, thus precluding its compatibility with the constitutional procedural guarantees granted to the suspect [
173,
179].
If this is the state of things in face recognition, then what about 3D face reconstruction? 3D reconstruction is already employed for enhancing the views of crime scenes (e.g., Reference [
142]), as computer-generated evidence [
22]. Thanks to the 3D representation of the scene, obtained by one or more reference photographs, it is possible to recreate the aimed scenario, for example, by inserting moving objects and simulating people’s behavior while respecting the physical laws. However, depending on the task for which it has to be employed, this technology could be considered not admissible due to the still experimental nature of the underlying method [
22]. Furthermore, the accuracy of the reconstruction of the human body is low, and the face is strongly influenced by the definition of the reference images as well as by the subjectivity of the operator in positioning the characterizing points for the reconstruction [
141].
Therefore, a fully automated 3D reconstruction such as the one integrated into biometric systems could reduce the errors caused by the operator, standardize the process, and speed up the analysis, provided that sufficient quality of the resulting 3D model can be guaranteed. These advantages led the research community to propose methods and approaches strictly focused on reconstructing the body or even single parts, such as the face. In particular, the 3D reconstruction of the face could be crucial for some forensic recognition tasks, strongly enhancing the recognition accuracy with respect to the recognition from raw images, especially on faces represented in non-frontal poses. In particular, this technology could even be integrated into the previously cited scene reconstruction technology to enforce the reliability of the related computer-generated evidence and make it employable for real recognition tasks.
These factors could be crucial in the introduction of this technology in the forensic recognition task. However, it must comply with the technical and admissibility requirements, summarized in Figure
4 and discussed in the following subsections, which any system must satisfy to be considered suitable to be employed in such a field.
2.2 Biometric Systems and Forensic Admissibility
Techniques and systems designed for biometrics, especially the automated ones, are appealing for their potential to address some forensic domain’s problems concerning crime prevention, crime investigation, and judicial trials in a more efficient, “scientifically objective,” and standardized way [
15,
112,
149,
162,
176,
190,
220]. In the case of face recognition, the related recognition technology has a role in many forensic and security applications, such as in identifying people of interest (e.g., terrorists) and searching for missing people, even in real-time [
71,
75,
99]. In particular, concepts behind biometric facial recognition could be beneficial in various tasks underlying forensic applications. For example, person re-identification and face identification could aid the search task of forensic practitioners, thus the collection of evidence from crime scene images acquired from surveillance cameras [
197], and the investigation, thus linking traces between crime scenes by generating and testing likely explanations [
197]. Face recognition could as well represent an aid in the individualization (or forensic evaluation) step, in which the evidential value is computed and assigned to the collected traces [
197], with noticeable parallelism with the similarity scores assigned by most automated face recognition systems in biometric recognition tasks.
However, despite several groups, such as the
FISWG (Facial Identification Scientific Working Group) [
9] and the
ENFSI (European Network of Forensic Science Institutes) [
4], which are currently working in this direction, there is no standardized and validated method in forensics [
15,
146,
149]. For example, in the United States, the admissibility of scientific evidence obtained through face recognition is generally evaluated through two guidelines:
–
the “Frye’s rule” gives the judges the task of assessing whether the technique or technology is accepted in a relevant scientific community [
1];
–
the “Daubert’s rule” adds to the previous one the constraints that it has been tested, the description of its error rate is available, and it must be maintained and adhere to standards [
2,
6,
7,
15,
63].
In many other judicial systems beyond the U.S.A., no specific admissibility rule regarding the evaluation of the scientific evidence is given, such as the case of the European judicial system, where the judges are generally responsible for its assessment in single cases [
15]. Another issue is the general acceptance of the biometric itself, especially the face, to the point that some governments banned or limited its usage even in law enforcement agencies (e.g., References [
48,
76,
136,
145,
218]). The concern is particularly related to positive identification due to the huge consequences of a false match in forensic cases combined with previous failures of face recognition systems in that direction [
31,
100].
Therefore, a robust and transparent methodology must be given for forensic recognition, the effectiveness of which has to be quantitatively assessable in statistical and probabilistic terms. The goal is to provide guidelines for quantifying biometric evidence value and its strength based on assumptions, operating conditions, and the casework’s implicit uncertainty [
72,
136,
197]. Besides, a set of interpretation methods must be defined independently of the baseline biometric system and integrated into the considered algorithm [
153,
197]. This allows reaching conclusions in court trials in agreement with three constraints (Figure
5): performance evaluation, understandability, and forensic evaluation [
27,
54,
110]. Closely related to these constraints, the
quality of the probe and reference data should also be considered in the admissibility assessment [
27,
176,
220].
2.2.1 Performance Evaluation.
Performance evaluation concerns the basic trust level of the system and its performance for a specific purpose; therefore, it supports the forensic practitioner’s decision when using such a system to perform a given task. For instance, a biometric system could be considered suitable for a specific task whenever it is tested and achieves a performance acknowledged as “good” on data representative of the working system’s context, e.g., a face recognition system designed to perform well on high-resolution frontal images is not required to achieve the same performance on images acquired by CCTV cameras and random head poses [
27]. In a statistical evaluation, the definition of “good performance” depends on the context, the data, and the end-users’ requirements set in the design process. The performance parameters are different according to the system itself and the specific task for which it should be employed. For example, the accuracy, namely, the percentage of correctly classified samples [
113], could be considered in evaluating the performance of classification problems, such as in the case of the face recognition task. Distance-based metrics can instead be used for evaluating the error between the predicted values and the real ones in regression problems such as the 3D reconstruction tasks. An example of the latter is the
Root Mean Square Error (RMSE), which considers the distance between a reconstructed facial part and the corresponding ground truth in terms of pixels (e.g., Reference [
229]). As previously mentioned, understanding the metrics employed requires basic statistics knowledge, which legal decision-makers often do not have. This makes it difficult to justify the use of a particular system by such metrics in a law court [
27]. Thus, a certain level of confidence in the underlying technical aspects is necessary to interpret the performance parameters adopted.
Another issue for the trust of biometric recognition systems in forensics is that of biased performance against certain demographic groups, meaning that the performance parameters may depend, on average, on the demographic groups present in the system’s dataset [
110,
212]. For example, biased performance on age, gender, and ethnic groups was recently reported [
33,
110,
166]. In face recognition, bias is a severe problem, since facial regions contain rich information strictly correlated to many demographic attributes, which could lead to biased performance [
117]. This issue has often been overlooked when face recognition systems were employed by law enforcement agencies [
82]. Thus, the missed analysis of this aspect, or on the demographic group representative of the casework, could lead to the inadmissibility of the biometric system in judicial trials or, simply, to unreliable support of the human expert decision. In other words, the choice of the datasets employed for training the system and evaluating it is one of the factors that must be considered in the performance evaluation [
108]. Furthermore, fairness, interpretability, and even performance could benefit from the ability of a system to provide information about how biased its decision could be [
72,
157] (see also Section
2.2.2).
2.2.2 Understandability.
Understandability (also known as interpretability [
27]) is the ability of a human to understand the functioning of a system, its purpose, its features, as well as its outcome and the (computational) steps that led to such a result. In particular, the understandability evaluation supports the decision of whether the outcome of the system is suitable. This is particularly relevant for legal decision-makers (e.g., judges) who are typically not experts in those topics [
11,
19,
27,
88,
110,
125].
A first step for making a system understandable is to design it as “explainable” in the decision-making process. This facilitates its traceability, which, in turn, could help prevent or deal with erroneous decisions by revealing the possible points of failure, the most appropriate data and architecture [
11,
47,
72]. The main difference between understandability and explainability is that the latter focuses on the system’s design [
27], while the former focuses on the end-user experience. Therefore, the system’s understandability requires an explainable design process.
A factor that can improve the system’s understandability is its transparency, meaning the ability of the forensic practitioner to have access to the details related to the functioning of such a system [
27]. For example, a fully open-source system is entirely transparent. However, even a fully transparent system does not imply its understandability, as in the case of image processing algorithms whose effects cannot be reversed. In other words, they cause a loss of details or an irreversible/random addition that could even impact the reproducibility required by any automated system to be employable by forensic practitioners for reaching conclusions [
139]. Moreover, even details about the algorithms and the implementation of very complex systems like neural networks could be insufficient for their understanding [
27].
Therefore, for both complex and black-box systems, such as those based on Artificial Intelligence, it should be necessary to add sufficient local and/or global interpretations through metrics and mechanisms [
11,
27,
72,
87,
125,
127]. For example, the forensic practitioner must be able to determine whether the system is using the face area instead of the background when computing the related outcome. Moreover, understandability is an aid for legal decision-makers in cases where both the prosecution and the defense of the suspect present contradictory results based on their own black-box systems [
110]. Some examples of approaches for enhancing the explainability and, in particular, spatial understandability in the context of face recognition are the extraction of features in different areas of the face [
222] and the use of model-agnostic methods (i.e., not tied to a particular type of system [
11,
87]) that visualizes the salient areas that contribute to the similarity between pairs of faces [
194]. Other approaches are the estimation of the uncertainty of features through the analysis of the distributional representation in the feature space of each input facial image, therefore assessing the uncertainty through the variance of such distributions [
186] and the analysis of the effect of features in the resulting outcomes such as facial angle and non-facial elements [
55,
196]. However, black-box systems such as deep neural networks still lack the reasonable interpretability to be effectively employed in forensic processing. In particular, understanding what information is being encoded from the input image into deep face representations would also help address eventual biases of the system (e.g., toward a demographic group) [
110,
156].
2.2.3 Forensic Evaluation.
Forensic evaluation is the assignment of a relative plausibility of information over a set of competing hypotheses (or “propositions”) [
27]. It supports the forensic practitioner’s opinion regarding the level of confidence and the weight (i.e., the strength) of evidence when the system makes a decision according to its outcome [
27,
108,
125]. The system’s performance and understandability are taken into account in forensic evaluation, together with contextual information (e.g., additional cues or supporting evidence from other sources) and general knowledge; thus, additional information that could be either included in the decision process or formalized into the automated system itself [
27,
58,
111,
171]. Therefore, forensic evaluation includes the above elements to drive forensic practitioners toward an appropriate decision (e.g., identification) that could be either conclusive or inconclusive according to the assessed level of confidence [
27,
46,
56,
57,
58].
From a technical perspective, forensic evaluation is quantitatively given by a statistical approach based on the
likelihood ratio values (LR) [
146,
167,
176,
197,
209,
217]. In particular, it is acknowledged that the LR allows for a transparent, testable, and quantitative assessment of the probability assigned to the evidence of a face match by forensic practitioners, based on personal experience, experiments, and academic research, against the probability of a non-match [
27,
171]. A semi-quantitative scale could also be employed, in which values are aligned with ranges of likelihood ratios (e.g., weak/medium/strong), or employ the relative strength of forensic observations in light of each proposition [
27,
39,
42]. Therefore, thanks to its transparency, testability, and formal correctness, LR allows the clear separation of responsibilities between the forensic examiner and the court. This makes it compliant with the requirements of evidence-based forensic science when quantifying the value of the evidence to the law court [
168,
175]. However, it must be remarked that calibrating a biometric score to become an LR requires a substantial amount of case-relevant data, thus data representative of the analyzed scenario regarding quality (see Section
2.2.4) and demographic group (see Section
2.2.1).
2.2.4 Quality Evaluation.
As previously pointed out, the characteristics of acquired data are also relevant. First, they should meet minimal requirements in terms of
quality [
176,
220]. Although not defined in a rigorous way, this term refers to factors that lead to blurriness, distortion, and artifacts in images. They may be caused by (1) the camera employed, whose sensor, optic, and analog-to-digital converter impact on the image resolution, the dynamic of gray-levels, its ability to focus on the target [
85,
108,
128,
137,
170,
187], (2) environmental conditions such as the illumination and the background of the scene, the same weather conditions (rainy/cloudy) [
105,
128,
170], (3) the subject’s distance from the camera that adds scaling and out-of-focus problems, his/her camouflage to evade recognition (sunglasses, beard/mustache, hat/cap, makeup, jewelry), the speed at which the subject is moving and the direction, the position of the face with respect to the camera, which can lead to non-frontal views and incomplete data [
85,
102,
124,
137,
152,
213], (4) the image processing embedded into the camera or next to the raw data acquisition, such as compression and re-sizing [
85,
108,
128,
170]. Therefore, quality must be evaluated for both probes and reference facial data to assess whether the proposed face recognition system is compliant with data of the kind [
10,
27,
105,
159]. Second, the data amount is crucial from the viewpoint of a new classification system to be trained and fine-tuned [
105,
220] and the calibration of LR frameworks for the evaluation of those and already existing systems (see Section 2.2.3), yielding to the creation of large-scale datasets for the evaluation of face recognition algorithms (e.g., Reference [
103]).
While the acquisition of mugshot images by law enforcement agencies is usually subject to strict control to ensure the truthful representation of appearance, this is often not the case with the acquisition of probe images and videos. Therefore, concerning the available data, it is necessary to assess the quality to determine whether it fulfills the aimed biometric function, including the 3D reconstruction task and the following recognition. The final goal is the system’s outcome employment in the forensic investigation and the following judicial conclusion. In the middle, the quality evaluation would allow the assessment of the confidence level in decisions based on such data or to rank and select the ones with the best quality (e.g., single frames from a surveillance video) [
98,
181,
198,
200]. To the state of our knowledge, a global standard for quality assessment is currently missing [
98,
181], probably also due to the human subjectivity factor in the task, and international standards are still under development (e.g., References [
106,
107]). However, a score based on the
Mean Opinion Scores (MOS) was proposed [
182] to justify the legal acceptance or the rejection of a potential probe image, video, or part of them. Unfortunately, the MOS method is often impractical, since it is considered slow, expensive, and, in general, inconvenient. Although other quality assessment methods have been proposed, most of them are not representative of human perception [
92,
214]. In our opinion, specific expertise in agreement with the law court process should be included (e.g., References [
182,
198]). Furthermore, quality measures about the “partial results” of the system should be integrated as well. For example, in the case of forensic recognition based on 3D face reconstruction from 2D images or videos, the 3D model reconstructed either from reference or probe data could be corrupted due to inaccurate localization of facial landmarks [
23], thus requiring the repetition of the localization process or even to discard the sample because it results to be unfeasible. Therefore, the quality measures could be integrated into forensic recognition, considering them as complementary features [
74,
98,
163]. This means that quality assessment would pass through the previously described requirements (Sections
2.2.1–
2.2.2) to be admissible in the analyzed context [
181] according to the forensic evaluation process (Section
2.2.3).
2.3 3D Face Reconstruction in Forensics
During the investigation phase, the subject’s identity is unknown, and the possible identities within a suspect reference set need to be rendered and sorted [
220] in terms of likelihood with respect to the evidence (e.g., a frame captured from a CCTV camera) [
14]. In addition to the classic challenges related to facial recognition in uncontrolled environments, such as low resolution, large poses, and occlusions [
89], forensic recognition faces even more challenges. Examples are the acquisition systems that are set up cheaply and subjects that actively try not to be captured by cameras, which enhance the previously cited issues and introduce novel problems such as heavy compression, distortions, and aberrations [
226]. Thanks to its greater representational power than 2D facial data, 3DFR can alleviate some of these problems. In fact, 3D data provides a representation of the facial geometry that reduces the adverse impact of non-optimal pose and illumination. Depending on the characteristics of the probe image and the reference set narrowed down by police and forensic investigation, whenever the investigator is required to compare these images, and it is necessary or advantageous to use an automatic face recognition system, 3DFR can be employed by following two different approaches, namely,
view-based and
model-based approaches (Figure
6), to improve the performance of facial recognition systems and, therefore, enhance its admissibility in legal trials.
In a view-based approach, the set of images containing frontal faces is adapted to non-frontal ones, and, thus, it is typically applied on the reference set to adapt the faces within mugshots to the probe image such that it matches the pose of the represented face [
160] (Figure
7). Although it allows comparing facial images under similar poses, this approach requires a reference set containing images of suspects captured in such a pose or synthesizing such a view through the 3D model of each suspect. In the latter case, each 3D model can be adapted after applying a pose estimation algorithm on the probe image before employing the actual recognition system [
59,
138,
228,
229]. Another proposed strategy is to introduce a gallery enlargement phase instead, which consists of projecting the 3D model in various predefined poses in the 2D domain to enhance the representation capability of each subject and then employing the synthesized images in the recognition task [
93,
131,
132,
230]. However, the view-based approach represents a suitable choice whenever multi-view face images of suspects are captured during enrollment for the purpose of highly accurate authentication, such as in the case of the verification task in face recognition [
93], although it usually involves higher computational cost in terms of both time and memory with respect to the model-based counterpart.
In a model-based approach, the adaptation phase is performed on non-frontal faces to synthesize a face in frontal view through the reconstructed 3D face [
93] (Figure
8). The normalized (or “frontalized”) face is then compared to the frontal faces within the gallery set to determine the subject’s identity in the probe image [
69,
95]. This approach is suitable for real-world scenarios in which it is necessary to seek the identity of an unknown person within a probe image or video in a large-scale mugshot dataset [
93], as in the so-called face identification task in biometric recognition, for maximizing the likelihood of returning the potential candidates. Despite the generally lower computational cost, this approach is only applicable when it is possible to synthesize good-quality frontal view images with the original texture, since it could provide complementary information for recognition with respect to the shape [
16,
134]. According to what we discussed in Section
2.2.4, the minimum quality requirements for the probe images must be met, which is not often the case in real forensic scenarios. Furthermore, it could be necessary to handle possible textural artifacts in the resulting frontal image [
36,
95,
233].
Hence, the application of a view-based approach would allow changing the scenario from a more traditional 2D-to-2D recognition to a 3D-to-2D recognition, in which the reconstructed 3D face representation is typically used to generate synthetic facial views matched with the probe image [
35]. This can be achieved by turning the 3D model in such a way that the pose matches the one in the compared image and eventually after applying similar light conditions on the model to ease the comparison (e.g., Reference [
211]). Similarly, a model-based approach could be exploited either for aiding the 2D-to-2D face recognition task, through the synthesis of non-frontal faces in the frontal view [
93], as it is typically the case of probe images, and the 2D-to-3D recognition scenario, where several synthetic views can provide a set of potential probe images [
206], in agreement with the reference ones. Coherently, these approaches would jointly allow a 3D-to-3D recognition scenario: The 3D representation of the face reconstructed from the reference images is compared with the one reconstructed from a probe video sequence [
35]. The view-based approach typically involves the reconstruction from mugshots and the model-based approach from probe images, mainly due to the typical qualitative characteristics of data. Nonetheless, it is still possible to employ these approaches on both sets of data, according to the specific task (e.g., it could be possible and convenient to apply a view-based approach on a surveillance video to ease the comparison). However, the potential bias towards the average geometry must be taken into account when reconstructing the 3D faces [
205], especially when the reconstruction is performed from single images.
3 3d Face Reconstruction for Mugshot-based Recognition
Although many attempts have been performed in the past years to reconstruct faces in the 3D domain, either from a single image or multiple images of the same subject [
148], only a few were evaluated for their potential applications in forensics. Among them, we want to focus on exploiting mugshot images captured by law enforcement agencies. The reason is that methods inspired by this approach are closer than others to satisfying the previously seen criteria for their potential admissibility in forensic cases.
To our knowledge, the earlier study on 3DFR from mugshot images for forensic recognition was proposed in 2008 by Zhang et al. [
230], who employed a view-based gallery enlargement approach to recognize probe face images in arbitrary view with the aid of a 3D face model for each subject reconstructed from mugshot images (Figure
9). To reconstruct the shape of such a model, they proposed a multilevel variation minimization approach that requires a set of landmarks specified on a pair of frontal-side views to be used as constraining points (i.e., eyes, eyebrows, nose profiles, lips, ears, and points interpolated between them [
232]). Finally, they recovered the corresponding facial texture through a photometric method. They evaluated their approach on the CMU PIE dataset [
188] using a holistic face comparator (or matcher) [
202] and a local one typically employed in biometrics for a textural classification [
13], restricting the rotation angles of the probe images to
\(\pm 70^{\circ }\). This analysis revealed a significant improvement in average recognition accuracy with respect to the original mugshot gallery, especially when the rotation angle of the face in the probe image is larger than 30
\(^{\circ }\). However, the limit of the rotation angle of faces in probe images and the use of traditional face comparators rather than state-of-the-art ones do not allow for assessing the actual improvement in the effectiveness of 3DFR from mugshot images in terms of forensic recognition [
93,
131]. Other drawbacks of the proposed method are the possible artifacts caused by the assumed model [
228] and the poorly explored image texture. Furthermore, they performed the analysis on a small-scale dataset containing only 68 subjects. Finally, despite improved performance and the usage of a local face comparator that enhances understandability [
222], expressing the similarity between the single facial parts rather than providing a global similarity and allowing the assessment of the salient areas that led to the outcome of the system, the authors did not utilize any strategy for facilitating the forensic evaluation. Moreover, the analysis of local patterns could also help address the presence of occlusions. Another aspect that could be considered is the computational time required for the gallery enlargement, which appears to make the method unsuitable for applications having strict time constraints, even considering how old the hardware system on which it has been tested is (Table
1). We further discussed this factor in Section
6.
Four years later, Han and Jain [
93] proposed to employ the frontalization approach in the considered scenario, as it had already shown its effectiveness in the biometric recognition from non-frontal faces [
25]. They proposed a 3DFR method from a pair of frontal-profile views based on a
3D Morphable Model (3DMM) [
26], a generative model for realistic face shape and appearance, to aid the reconstruction process. They reconstructed the 3D face shape through the correspondence between landmarks within the frontal image and those on the profile one and extracted the texture by mapping the facial image to the 3D shape. A view-based gallery enlargement approach and model-based probe frontalization approach (Figure
10) were employed to enhance the performance through the proposed reconstruction approach. They evaluated them on subsets of PCSO [
3] and FERET [
161] datasets through a local face comparator and a commercial one, revealing an improved recognition accuracy in both cases. One of the most evident limits of the reconstruction approach in a forensic context is that the involved 3DMM is a global statistical model that is limited in recovering facial details [
148], as it could be dominated by the mean 3D face model, which potentially introduces a bias of the outcome towards the underlying model [
206]. This aspect could be further enforced by the relatively low quality of the employed images. Furthermore, the involved 3DMM could cause evident distortion when the model is largely rotated [
132,
229]. Other limits of this work are that the authors did not fully explore the texture and did not use state-of-the-art face comparators [
131,
228]. Therefore, as in the previous case, despite the improvement in performance and the enhanced understandability, thanks to local features, the authors did not employ any framework for easing the forensic evaluation of their method. Finally, no information about the computational time was reported.
In the same year, Dutta et al. [
59] proposed a method based on 3DFR for improving face recognition from non-frontal view images through a view-based gallery adaptation approach (Figure
11). They applied existing recognition systems to the 16 common subjects in the CMU PIE [
188] and Multi-PIE [
81] datasets, containing frontal and surveillance images, respectively. The adaptation of the reconstructed model to the pose estimated from a probe image could be particularly advantageous whenever poor-quality probe data were acquired, while it is possible to obtain the 3D model from images having a higher quality, such as in the case of mugshot images (Figure
11). However, this approach requires an accurate estimate of the pose of the face in the probe image. Furthermore, the small number of subjects involved in the study should be enlarged to simulate a forensic case and evaluate the improvement entity for assessing their applicability in real-case scenarios. Despite the advantages in some application contexts in terms of performance, the authors did not take into account understandability or forensic evaluation. The required computational time was not assessed as well.
Similarly, Zeng et al. [
228,
229] reconstructed 3D faces from 2D forensic mugshot images, employing frontal, left profile, and right profile reference images, through multiple reference models to obtain more accurate outcomes for enhancing recognition performance through a view-based gallery adaptation approach. To this aim, they used a coarse-to-fine 3D shape reconstruction approach based on the three views through a photometric method and multiple reference 3D face models. The use of multiple reference models is an attempt to limit the homogeneity of reconstructed 3D face shape models and increase the probability of finding the most similar candidate for the single parts of the input face. The so-reconstructed 3D face shapes were then used in the recognition task to establish correspondence between the local semantic patches around seven landmarks on the arbitrary view probe image and those on the gallery of mugshot face images, assuming that patches will deform according to the head pose angles. The authors [
228] tested their approach on the CMU PIE [
188] and Color FERET [
161] datasets. They showed that deforming semantic patches is effective [
13] and compared the performance with a commercial face recognition system [
154] and the previously described method proposed by Zhang et al. [
230]. The authors [
229] also evaluated the enhancement using a
machine learning (ML) classifier on different poses within the Bosphorus [
178] and Color FERET [
161] datasets. As the authors suggested, the improvement in recognition capability from arbitrary position face images is due to the greatest robustness of semantic patches to pose variation and the higher inter-class variation introduced by the subject-specific 3D face model. A limitation of this work is the out-of-date involved face comparators [
131]. Furthermore, although the method employs multiple reference models, the outcome could still be biased toward them [
206]. Finally, despite the fact that the proposed method enhances the performance of an understandable recognition approach, thanks to the employed local recognition approach, the authors did not perform any forensic evaluation. Moreover, despite assessing the test time on a single probe image, the authors did not report the computational time required for the reconstruction of the models in the reference gallery nor for the training of the recognition system (Table
1).
In 2018, Liang et al. [
131] proposed an approach for arbitrary face recognition based on 3DFR from mugshot images that fully explores image texture. The proposed shape reconstruction approach is based on cascaded linear regression from 2D facial landmarks estimated in frontal and profile images. After reconstructing the 3D shape, they approached the texture recovery through a coarse-to-fine approach. Therefore, they employed the proposed method in a recognition task on a subset of images from each subject of the Multi-PIE dataset [
81] through a view-based gallery enlargement approach on state-of-the-art comparators based on
deep learning (DL). Furthermore, they compared the performance before and after the gallery enlargement and by fine-tuning the comparators with the generated multi-view images. The results highlighted improved recognition accuracy in large-pose images, especially with fine-tuned comparators. In particular, this method provides better results than the one proposed by Han and Jain [
93], probably because of the major focus on reconstructing texture information [
131]. Hence, the most significant novelties introduced by this work are the textured full 3D faces reconstructed from the mugshot images and the analysis on DL-based comparators, inherently more robust to pose variations than traditional ones [
131]. Furthermore, they fine-tuned those comparators with the enlarged gallery, revealing even better performance than the previous gallery enlargement approaches. The authors also assessed the computational time required for the reconstruction of the 3D models, revealing a huge improvement with respect to the previous study reporting it, still considering the different capabilities of the physical system on which it has been tested (Table
1). Despite the reconstruction method appearing suitable for real-time applications [
131], the authors did not report the computational time required for training and testing the recognition system. A limit of the proposed method is that it does not consistently work across all pose directions, revealing worse performance for some poses than in the original gallery (e.g., in frontal pose). Furthermore, the evaluated performance could suffer from demographic bias due to the unbalanced demographic distribution related to the dataset employed in the experiments [
81]. Finally, the authors did not take into account any understandability or forensic evaluation.
In 2020, the same authors published an extension of this work [
132], in which they also proposed a DL-based shape reconstruction. In this work, the authors extended the evaluation of the face recognition capability of the proposed method based on linear shape reconstruction by employing a subset of the Color FERET dataset [
161], obtaining a higher recognition accuracy on average as in the case of the Multi-PIE dataset [
81]. Furthermore, they tried to solve the drawback of their previous work, related to worse recognition performance for some poses, with respect to usage of the original gallery, through a fusion between the similarity scores obtained by both the original mugshot images and the synthesized ones. The improvements previously observed by combining 2D images and 3D face models in multi-modal approaches [
10,
29,
30,
43,
44] were therefore confirmed. This approach, evaluated on the Multi-PIE dataset [
81], revealed consistently better performance on all the pose angles. With respect to their previous study, the authors also reported the computational time required for training the recognition system (Table
1). Despite the proposed novelties, the authors did not assess if the proposed DL-based shape reconstruction approach is able to enhance recognition capability. Finally, the study did not consider understandability or forensic evaluation.
A quantitative comparison among the previously reviewed methods would require the usage of the same face comparators and their evaluation on the same ground truth data through the same performance metrics, and this is often unfeasible due to many factors, such as the current state-of-the-art datasets when the work has been proposed. Similarly, a comparison in terms of computational time is not suitable both due to the unreported information about time complexity and the differences in terms of physical systems on which the proposed methods have been tested. However, a qualitative comparison is provided in Table
1 and then discussed in Section
6.
4 Other Applications of 3d Face Reconstruction in Forensics
In addition to recognition from mugshot images, 3DFR could represent a valuable aid in other forensic contexts to facilitate the recognition of a subject. An example is the search for missing persons. Taking into account such a scenario, Ferková et al. [
68] proposed a method that includes demographic information to improve the outcome of the reconstruction from a single frontal image and, at the same time, speed up the related computation. In particular, the method estimates the 3D shape of the missing person’s face by taking into account age, gender, and the similarity between the landmarks of the reference depth images and those previously annotated in the input image. Then, planar meshes are generated by triangulating between the input image and the depth image. The authors reported that their reconstruction method requires a computational time lower than 3 seconds and strongly depends on the underlying landmarks estimation algorithm. Despite the good geometrical results, the width of the outcome is usually overstretched, and the generated 3D face model does not include the forehead. Furthermore, the authors did not quantitatively evaluate the contribution of their method to recognition capability or their potential admissibility in forensic scenarios.
Similarly to some of the previous studies, Rahman et al. [
165] highlighted how 3D face models could enhance forensic recognition from CCTV camera footage. In particular, they reconstructed the 3D face models from single frames by optimizing an
Active Appearance Model (AAM), an algorithm that matches a statistical model of shape and appearance to an image [
115]. Therefore, they evaluated the improvement in the recognition capability of different ML models with respect to 2D AAMs. However, this study on the possible application of 3DFR to forensic recognition from surveillance videos is limited to a dataset of a few subjects, which is not publicly available. Finally, the authors did not assess the recognition performance and did not investigate its admissibility in terms of understandability and forensic evaluation.
With a similar purpose, Van Dam et al. [
204] proposed a method based on a projective reconstruction of facial landmarks. An auto-calibration step is added to obtain the 3D face model from CCTV camera footage. The authors considered the specific case of fraud to an
Automatic Transaction Machine (ATM) with an uncalibrated camera under very short distance acquisitions with a distorted perspective [
201]. They analyzed how the quality of the resulting 3D face model is affected by the number of frames and the number of landmarks, assessing the minimum values for a precise perspective shape reconstruction, which could, however, be affected by the eventual errors on the estimated landmark coordinates introduced by the noise. However, the authors did not quantitatively assess the method’s improvement with respect to its 2D counterpart in face recognition. Neither understandability nor forensic evaluation was addressed.
In 2016, the same authors proposed another method to reconstruct a 3D face from multiple frame images for an application in the forensic context [
206]. Such a method employs a photometric algorithm to estimate both the texture and the 3D shape of the face. The goal is to avoid generating an outcome biased towards any facial model, thus enhancing the suitability in a forensic comparison process. The proposed method is a coarse-to-fine shape estimation process: It first provides a coarse 3D shape [
205] and other pose parameters from landmarks in multiple frames, and then a refined shape is computed by assessing the photometric parameters for every point in the 3D model. The last step also allows estimating the texture information, thus providing the dense 3D model. The authors evaluated their method in a recognition task on a homemade dataset of single-camera video recordings of 48 people containing frames with different facial views. The reconstructed textures with the ground truth images were compared through FaceVACS [
79] by increasing the considered frames among iterations, revealing enhancement in recognition results in most cases. Furthermore, using the likelihood ratio framework, they highlighted that in more than 60% of the cases, data initially unsuitable for forensic cases became meaningful in the same context through the proposed method. As the authors suggested, the outcomes can be used to generate faces under different poses, while they are not suitable for shape-based 3D face recognition. Despite the enhanced suitability in forensic scenarios, one of the most significant drawbacks is that the model-free reconstruction approach is computationally more burdensome than a model-based one and requires multiple images. Furthermore, the authors did not quantitatively evaluate their method on publicly available datasets. Although the authors did not assess understandability, they introduced a forensic evaluation of their method based on 3DFR; thus, in our opinion, this is the most significant work on 3DFR applied to forensics.
Unlike all previous approaches, Loohuis [
135] proposed to employ 3DFR for facing the lack of facial images, which could be used in training ML and DL models for face recognition tasks, for example, in a surveillance scenario. The author combined a method for generating face images with rendering techniques to simulate such adverse conditions and assessed the impact of the resulting synthetic images on existing face recognition systems. In particular, the method proposed by Deng et al. [
53] for reconstructing the 3D model of the face, based on a DL model [
96] and a 3DMM [
158], has been applied to the single images of a subset of the ForenFace dataset [
227] to generate images simulating different levels of image degradation. Unfortunately, the proposed method does not perform well on very low-quality images. However, a reasonable level of degradation in many forensic scenarios can still be, mimicked because the generated images show a high degree of similarity with the reference ones. Moreover, a similar approach employing 3DFR for generating degraded synthetic views has already been demonstrated to enhance the recognition performance of automatic face recognition systems from low-quality videos, such as those acquired by surveillance cameras, with holistic, local, and DL approaches [
102]. Furthermore, despite the human subjectivity in perceiving the quality of an image, such an approach could even be employed in the development of quality assessment algorithms for facial images, since it would allow comparing the degraded image against a known reference version thereof, thus aiding the selection of potentially suitable samples either for the reconstruction or the recognition tasks [
181].
6 Discussions
In this article, we reviewed the state-of-the-art of
3D face reconstruction (3DFR) from 2D images and videos for forensic recognition, evaluating the proposed approaches with respect to the requirements of a potential forensics-related system. Furthermore, the proposed approaches for enhancing forensic recognition in terms of performance were analyzed together with their potential application scenarios (Figure
6).
The previously described studies mainly focus on enhancing the performance of recognition tasks in different contexts, such as the identification or verification of suspects within a gallery of mugshot images or the search for missing persons. They revealed the potential advantages of the fusion of the reconstructed model and the original images, which would allow taking advantage of the characteristics of a 3D facial model while limiting the possible loss of information in the reconstruction [
132]. So far, researchers have proposed employing 3DFR either on the reference data or the probe material by re-projecting the 3D model into 2D images to aid a 2D-to-2D recognition. In particular, the first approach could find application in the adaptation of the pose of the model to the face in the probe material for easing both the visual comparison for investigative purposes and for employing the so-obtained figure in comparing it with the probe face through an automatic system, as preliminarily proposed for its application in forensic scenarios by Dutta et al. [
59] and then further investigated by Zeng et al. [
228,
229]. Similarly, this projection of reference 3D faces in the 2D domain in various poses demonstrated to improve the recognition performance of such systems, especially concerning their robustness to pose variation, introducing it as an augmentation step for training feature-based [
93,
230] and DL systems [
131,
132]. Moreover, Loohuis [
135] suggested that 3DFR could be successfully employed for mimicking the degraded quality of the probe data when coupled with rendering techniques for simulating such adverse conditions. The 3DFR from probe material finds applications in many scenarios as well, easing the comparison from a single probe image by rendering the face to match the pose with a reference image [
93,
165] or by reconstructing it from multiple frames of a surveillance video [
204,
206].
Despite the promising results, especially concerning the robustness to pose variations in various probe and reference data types, most of the previously described studies did not evaluate their methods considering other requirements of an automated system supporting forensic analysis (Figure
5) related to understandability and forensic evaluation [
27,
110], as summarized in Figure
4. Moreover, the proposed methods do not assess their robustness to some typical issues of forensic cases, such as the presence of occlusions [
94,
114], making them inherently unsuitable for recognition scenarios involving them (Section
2.2). However, some of them implicitly used a face recognition algorithm based on local descriptors [
93,
228,
229,
230], which supports the understandability of the output [
190,
222]. Furthermore, a single study [
206] employed a framework for easing forensic evaluation.
Although most of the proposed methods aim to enhance face recognition performance, they are not comparable quantitatively due to the variability in the considered settings. One of the most relevant differences is related to the involved datasets, which differ in acquisition environment, size, and availability (Section
5). The differences in terms of data type and quality represent another factor that makes them suitable for different tasks. Thus, it is necessary to address and compare these datasets separately (Section
2.2) in terms of the recognition approach (Figure
12) and application scenarios (Sections
3 and
4). Of course, differences are due to the time of publication, but recently the withdrawal of their availability due to more strict privacy rules on biometric data in the latter years made things complex. For example, the
General Data Protection Regulation (GDPR) rules in the European Union strongly differ from those of other countries [
116,
235]. In particular, future studies should be based on datasets suitable for forensic research. The model, in our opinion, is the ForenFace dataset [
227], because it takes realistic circumstances into account and also provides a set of anthropomorphic features proposed by the FISWG [
83]. Furthermore, they should evaluate the face reconstruction accuracy on large-scale 3D face datasets, such as FIDENTIS [
203]. Some forensic use cases are not yet included in any benchmark dataset; for example, the special case of CCTV-based recognition from images recorded at ATMs with a very short distance from the subject and a distorted perspective [
108].
For both reconstruction and recognition tasks, a demographic analysis should be conducted on the performance to assess the bias against some demographic groups, an undesired issue in forensics that is sometimes overlooked even in current research [
110]. To this aim, explicit demographic information about subjects represented in the datasets could aid in facing such an issue [
180]. However, this useful data may be difficult to be assembled and recovered due to the privacy rules mentioned above. Moreover, the source of this bias could be related to the unbalancedness of the underlying data. This issue could be relieved by employing synthetic datasets, like the FAIR benchmark [
66]. However, the employment of synthetic data still requires more investigation to be fully validated [
45] and then accepted in the forensic context.
The eventual underlying 3D reference model could be affected by bias problems as well [
110], which may affect the face recognition system, making it unsuitable in forensic cases [
206]. Therefore, a
model-free reconstruction approach should be employed whenever possible. An example of this reconstruction approach is stereophotogrammetry, which allows capturing craniofacial morphology in high quality [
97] to a level of detail that is often less important in generic recognition applications but which becomes crucial in the forensic context. Although it could not be suitable for its involvement in 3D-to-3D recognition scenarios, especially when based on shape comparison, such a reconstruction approach could be exploited in the generation of synthetic views for the comparison with the reference material [
206] and, therefore, employed in a 2D-to-3D scenario. However, a drawback is the requirement of multiple images of the suspects [
206], which cannot be acquired in any forensic case. Another disadvantage of a model-free reconstruction approach is the significantly higher computational time required, making it unaffordable for real-time applications. Nonetheless, this represents a minor issue for many forensic applications, such as the ones related to lawsuits.
Thus, when a photometric reconstruction approach is unsuitable, a choice between approaches based on 3DMM and DL must be made [
148], even those not strictly proposed for forensic applications, taking into account their suitability, advantages, and drawbacks. For example, the methods based on 3DMM allow generating an arbitrary number of facial expressions, while those based on DL provide high-quality face texture synthesis [
77,
148]. Therefore, the morphological model could be employed either for adapting the expression of the reference model or for imposing a neutral expression on the normalized face in the probe image, while a detailed reconstruction could be obtained through a DL network [
77]. However, it must be pointed out that huge manipulation, such as expression modification, could not be allowed in most evaluation cases, still being a valid aid for investigation purposes. These approaches have technical limits, namely, the focus on global characteristics rather than fine details of the morphological model and the requirement of a great number of 3D scans for the training of DL networks [
148]. The lack of understandability is another issue for the DL approach as well [
27]. However, combining two or more reconstruction approaches could help limit some of the drawbacks of the single approach. For example, previous studies highlighted that it could be possible to reconstruct 3D faces that are highly detailed even with a single image by combining the prior knowledge of the global facial shape encoded in the 3DMM and refining it through a photometric approach [
37,
130,
172]. Similarly, the combination between a morphological model and one or more DL networks has been proposed as well [
65]. State-of-the-art methods not explicitly proposed for forensic applications should be further investigated in terms of potentialities and suitability as well, especially those based on DL, which revealed to be promising in addressing some of the typical issues in forensics such as occlusion removal (e.g., References [
147,
183]), 3DFR from one or multiple in-the-wild images (e.g., References [
133,
231]), and face frontalization (e.g., Reference [
223]), thus potentially representing an aid in many investigative scenarios.
The computational time represents one of the main reasons why automated systems should be employed in forensics. It is an important feature in some specific applications, such as real-time identification through surveillance cameras (e.g., Reference [
71]). In this regard, the online computational time must be assessed, representing the time required to test a single probe image and, thus, to recognize the captured individual. Specifically, it depends on both the recognition algorithm and the eventual strategy that must be applied to the probe to enhance the recognition task (e.g., some “canonical” representation). In these terms, a reasonable computational time for some applications related to surveillance and lawsuit was reported by Zhang et al. [
230] and Zeng et al. [
228] (Table
1). Zhang et al. [
230] and Liang et al. [
131,
132] also evaluated the offline computational time, representing the time required for applying the proposed enhancement approach based on 3D face reconstruction (e.g., gallery enlargement) and for the training of the recognition system. In particular, reported values suggest a notable improvement with respect to earlier proposals. However, despite these representing the most time-consuming processes, the offline computational time is generally of less concern, since it does not impact real-time operations.
It is important to remark that the most important feature in forensics is generally the reconstruction accuracy [
15,
68,
195], since it represents a requirement that is often more strict than in generic recognition tasks. In the literature, 3D model quality is evaluated from the errors in terms of shape by estimating the distance between the model and the corresponding ground truth. However, the extracted texture’s quality should be assessed as well due to its role in the recognition task [
12,
16,
97]. For example, the texture could allow exploiting facial marks, such as scars and tattoos. Their exploitation would enhance both performance and understandability in forensic comparison [
15,
83,
157,
190,
191]. Furthermore, these facial marks are becoming even more valuable, thanks to the availability of higher resolution sensors and the growing size of face image databases and their capability to improve speed and performance of recognition systems [
111]. Hence, future research should take into account these additional features to assess their permanence in the generated 3D models. This also holds for other morphological features, which forensic examiners evaluate to justify the outcome of the facial comparison (e.g., the decision whether the suspect is likely to be the one represented in a probe image) [
9]. In addition to holistic ones (e.g., the overall shape), local characteristics are related to the proportions and the position of facial features, such as the relative size of the ears with respect to the eyes, nose, and mouth [
83]. The asymmetry between facial components should also be considered [
83], thanks to its higher physical stability over time than other features. For example, the overall shape of the face could change because of the weight increase [
86]; however, the asymmetry between facial components is less affected. Therefore, these features could be an effective aid for forensic examiners even to justify their conclusion on the comparison in law courts.
To sum up, we expect that great attention will be paid to the improvement of the recognition capability in forensic scenarios by 3DFR. Extremely unfavorable conditions, typically encountered in criminal cases, could be more affordable by considering both shape and texture appropriately modelled. To this goal, data representative of forensic trace and reference material are necessary, also considering the robustness to other common factors altering the appearance, such as facial hair and the presence of occlusions. The bias toward a demographic group would be avoided in the datasets, favoring the system’s fairness. In our opinion, the proposed algorithms’ understanding would couple with data availability. Data and algorithms will play a central role in effectively integrating 3D face reconstruction from 2D images and videos in the forensic field. Similarly, the employment of frameworks for easing forensic evaluation by non-expert professionals should become a practice for stressing the admissibility of the proposed methods in real cases. To this aim, an interdisciplinary approach involving computer science and law experts would speed up this process. Therefore, we believe that its future involvement in real-world forensic applications is not far and that this survey contributes as a step toward this scenario.