3.2 Fractal Encoding
After the preprocessing step described in Section
3.1, the Fractal Encoding module is applied. First of all, from the landmark detected, the module creates a mask of the depth image. The landmarks are seen as the inner points of a polygon. The mask is then created by building the convex polygon containing the landmarks. This step is essential in the case of fractal encoding since, as we will describe in detail, fractal encoding is based on the concept of self-similarity. In fact, self-similarity between blocks is enhanced by the huge differences between the black and grayscale parts of the mask images. Self-similarity is a typical property of fractals, derived from the definition of a self-similar object that Peitgen defined in [
1] as follows:
If parts of a figure are small replicas of the whole, then the figure is called self-similar. A figure is strictly self-similar if it can be decomposed into parts that are exact replicas of the whole. Any arbitrary part contains an exact replica of the whole figure.
Using the well-known fractal coding scheme, originally devised as a lossy image compression algorithm, it is possible to explore the self-similarity properties of two depth frames representing pose variations with a similar head rotation. In particular, by determining the forms’ redundancy at different scaling factors, it is possible to obtain an estimate as close to the source frame as possible. Hence, face position variations can be efficiently described in terms of their intrinsic auto-similarities. It is important to underline that the fractal coding approach achieved high pose estimation accuracy in the RGB domain, rivaling the performance of the best machine learning-based approaches [
7].
Fractal compression was first proposed by Jacquin in 1989 [
21]. He described a method called the
Partitioned Iterated Functions System (
PIFS) with the following steps:
(1)
Partition the original image into a set of blocks of size nxn, named Range blocks, and indicated with \(R_i\) .
(2)
For each range block \(R_i\) , search a block \(D_i\) obtained by partitioning the original image into 2nx2n blocks, such that \(R_i\) and \(D_i\) are similar. The set of blocks obtained will be called Domain blocks.
(3)
Find functions \(H = \lbrace f_1, \ldots , f_n\rbrace\) where \(H(D_i)\) transforms domain \(D_i\) into range block \(R_i\) .
The functions to obtain the range blocks from the domain blocks, can not be ordinary ones. In fact, to guarantee self-similarity, the features must be rescaled, along the
x and
y axes, by affine transformations. Be
x an array, an affine transformation is defined as
where
A is a matrix, often representing a rotation, and
b is the translation array. In particular, in case of the grayscale image,
A is called the
transformation and
b the
offset vector. The array
x will be, in this case, composed by the
\((x,y)\) coordinates and
z gray level of the image.
In addition, since the functions in which we are interested must transform blocks of dimension
2nx2n in blocks of dimension
nxn, we have to look for contractive functions, also called
contraction mapping defined as follows:
In other words, for each couple
\((x,y)\) of points in the grayscale image, after the transformation, their distance
d must be increased k-fold, where
k is a fixed value between 0 and 1.
Finally, to ensure that iteratively proceeding through those transformations will obtain a closer and closer representation of the original image, we have to observe that the contractive transformations satisfy the hypotheses of the fixed point theorem defined as follows:
Fixed Point Theorem
In a complete metric space
\((M,d)\) if
\(f:M \rightarrow M\) is a contractive transformation with parameter
k, then exists and is unique, a fixed point
\(x_i \in M\) such that
and for any point
x in
M is also true
For this reason, the iterations of PIFS are also called fixed point iterations.
The fractal encoding described produces a matrix representing the transformation to be performed on each domain block to obtain the range block. This transformation is used in fractal compression to build the compressed image. In our case, we are not operating with the aim of compression, and we will use the obtained matrix as the feature to perform HPE. We also demonstrate with experiments that we will later introduce that, in some cases, for some particular sets of data, it is more effective to proceed with a modified version of the fractal encoding that we can call frontal fractal encoding. In this particular version, instead of using the same image in the fractal module to build the domain and range blocks, we will use a random frontal face as a reference.
In this case, we can observe that, instead of a proper self-similarity, we are actually searching for a frontal-similarity. In fact, we are searching for those particular transformations that are able to codify a generic image as a frontal one. The particularity is that, to make this method subject-invariant, we will use a random frontal image, which can even come from a very different subject (e.g., different facial traits, gender, or age).
After we described how to obtain the encodings, we have to clarify how it is possible to obtain the head pose estimation from the latter. In fact, we will use a portion of the datasets to build a reference model. The reference model is obtained using a set of head pose images with a wide range of variations. Those images are encoded with the method above, and an encoded matrix is thus obtained from each image. Each matrix is then converted into an array by concatenating its rows. The motivation for this step is the fact that, once we have an array for each image, we can use each image as a row of a matrix that we will call the template. This template has a number of rows equal to the number of images in the set and a number of columns that depend on the dimensions of the images taken into consideration. Each transformation block contains two elements as the coordinates of the range block: one element representing the inversion, one representing the rotation, one representing the contrast, and one representing the brightness.
If we define as
NxN the dimension of the original image and
nxn the dimension of the range blocks, the number of rows of the original encoding matrix will be
\(N/n\) . As a consequence, the number of columns in the template will be
\((N/n)*6\) . At each element of the template, the head pose ground truth in pitch, yaw, and roll is associated and stored as a separate file. Once this reference template is built, when an image comes in as input, it will be encoded following the same scheme and its fractal encoding matrix will be transformed into an array. This array will be compared with the template mentioned above. The study about the comparison of the input array with the array in the template is performed following the distances and protocols introduced in Section
3.4.
3.3 Keypoint Intensity
Utilizing the intensity of keypoints in the depth image for head pose estimation provides numerous benefits, particularly in real-time driving scenarios. First, keypoint intensity permits a more accurate and efficient estimation of the orientation and position of the head in 3D space. In a driving scenario, the lighting conditions can vary significantly, which can affect the algorithm’s precision. By utilizing the intensity of keypoints in the depth image, the algorithm becomes less sensitive to these variations and can produce more accurate results. Second, the intensity of keypoints can be used to track specific head features, such as the tip of the nose, the center of the eyes, or the corners of the mouth. This information is especially useful for estimating head pose because it enables the algorithm to determine the head’s position and orientation with greater precision. In addition, the use of intensity-based keypoints can reduce the computational complexity of the system in comparison to the use of full depth maps. This reduction in complexity is especially advantageous for real-time applications, such as driving scenarios, in which the head pose estimation algorithm must process data quickly and effectively. In our system, after the landmarks are detected on the depth image, the corresponding levels of intensity are extracted from their coordinates. For each of the 68 landmarks detected by the preprocessing step, it is associated with the depth gray level.
As previously described in Section
3.2, the template is built using a subset of the datasets as a reference. Here, in particular, for each image of the reference, the preprocessing step is performed, and the gray levels of the 68 landmarks are saved in an array. The template is then composed of
N rows, representing the
N images in the reference, and 68 columns, representing the 68 gray levels of the depth image. Each row is associated with the head pose that it represents, stored in a separate file.
Using only the keypoints instead of the complete depth image can be both an advantage and a disadvantage. In particular, from a computational point of view, we have only 68 values to compare for each reference. On the other hand, if some of those 68 values are not good, we lose significant information. Some of the 68 values may fall too far away from the camera to perform the comparisons when the subject is over the sensitivity of the depth sensors. In this case, due to the limitation in range of the depth camera, some of the keypoints fall into a dark-zone where there is no depth information other than the background, defined as black. This is the case for some datasets built using old depth techniques such as Kinect 1, with a limited range and an automatic background segmentation that removes each part of the subject that follows at a distance higher than a fixed value (in meters). This effect will be seen in the Biwi dataset in Section refothers. To avoid this problem, after the landmark extraction, if more than three keypoint intensities are equal to 0, e.g., because information is missing, we will exclude those sets of keypoints from our evaluations. It is important to underline that this happens only when data is captured with old devices. In new devices, such as Kinect 2, all the keypoint intensities are able to pass this test, and no information or frames are missing in this case.
In addition, another problem related to low resource sensors is the depth at which the sensor is able to collect data. In another dataset, ICT-3DHP, which we will discuss in experiments, we noticed that the information inside the face, e.g., the gray levels, appeared very flat. In fact, once we calculated its entropy, using the classical Shannon entropy formula, we obtained a mean value of 1.55 in the dataset. If we consider that the mean value of Pandora, which we also experimented with, is 3.55, it is clear that there is a huge difference between the datasets in terms of information. To alleviate this phenomenon, when the amount of information carried by the image is too low, we use not only the keypoint intensities, also their spatial coordinates. In fact, we will compute the 68 landmarks as 3-D points in the space as
\(K(x,y,i)\) where
\(x,y\) are spatial coordinates, and
i is the intensity of the gray level at this point. We will then substitute the previous keypoint intensity with the following:
where
\((x_{33},y_{33},i_{33})\) represent the nose, that is the 33
\(^\circ\) landmark. This makes
\(h_j\) the 3-D distance of the jth keypoint from the nose. The new array is then created following this rule, as well as the reference template. To decide if the spatial information is necessary, as introduced, we used the entropy level. In particular, we empirically set the threshold of the entropy level to 2, based on the observations made on the dataset above. This threshold represents the point that divides the image for which spatial information is essential from the image for which these features become pejorative.
After the Template is built with one or the other technique, for the input image, we will follow the same path, and then we will compare the array of the input image with the ones stored in the Template. To do this, as in the case of the fractal encoding, we tested various distances that we will introduce in the next section.
3.5 HYDE-F Fusion
As described in the previous Sections, the fractal encoding method and the keypoint intensity method are able to perform the HPE, individually. However, we want to combine the responses of those two techniques to improve the final results. Since we want to compare the responses of each method, the fusion technique we use is in the field of score-level fusion techniques. Compared to other kinds of fusion, this is the most suitable to use when two different techniques are involved, as in our case. In particular, we have from the keypoint intensity a feature array of length 68 and, for the fractal encoding, a length that depends on the parameters we set. This size is, however, always different from 68. For this reason, it is not suitable to use feature-level fusion. Score level fusion is very popular in biometric techniques [
34].
Here we propose a fusion technique based on optimization. In particular, as we will discuss in greater detail in the experiments section, we chose 20% of each dataset to be analyzed. For this subset, we can perform in parallel the fractal encoding of HPE and the keypoint intensity of HPE. In particular, we consider as
\(KI=(p_{ki},y_{ki},r_{ki})\) and
\(FE=(p_{fe},y_{fe},r_{fe})\) the prediction of the keypoints intensity and fractal encoding methods, respectively, where
\(p_*, y_*, r_*\) represent the pitch, yaw and roll prediction arrays of each method. For the control set, we have the ground truth available, which we call
\(GT=(p_{gt},y_{gt},r_{gt})\) . Our aim is to find a method to obtain a function
f such that
\(f(KI,FE)=GT\) . In particular, since the fractal encoding and the keypoint intensity have different behaviors along the three axes, we will split the problem into three parts. We are, thus, interested to find
\(f_p, f_y\) , and
\(f_r\) such that
\(f_p(p_{ki},p_{fe})=p_{gt}\) ,
\(f_y(y_{ki},y_{fe})=y_{gt}\) and
\(f_r(r_{ki},r_{fe})=r_{gt}\) . Since each method is able to obtain reasonable results along the three axes even if considered alone, we can assume the problem to be solved is a simple one, and we rewrite the problem as a linear
Ideally, we want to find the tuple
\((a_p,a_y,a_r,b_p,b_y,b_r)\) that gives us the exact ground truth for every image. However, in a control set, depending on the datasets, they are stored in frames between
\(2,\!500\) and
\(10,\!000\) , which leads us to as many equations. For this reason, this problem cannot be solved as an equation system because it results in having no solutions. We will solve this problem as three optimization problems, trying to minimize the differences between the ground truth and the predicted values. The resulting problems are formulated as
For our particular kind of problem, before choosing the optimization method to be applied, we can make some observations. First of all, for each image, we will obtain a different prediction. This means that
\(p_*, y_*, r_*\) are arrays. Their length depends on the number of samples in the control set, which we can define as
n. To obtain a singular function to be minimized, we can assume that our objective is to minimize the sum of the errors committed by the methods over all the data in the control set. For this reason, the functions to be minimized assume the following formulas:
This observation significantly reduces the computation, allowing us to handle this problem as a linear, bi-dimensional problem. In addition, we do not have constraints, because the angular values can also be negative, and our only aim is to minimize the errors. We did not add any constraints to force the
\(a_*\) and
\(b_*\) values to be non-zero. This is because we also want to investigate if there are some topics on which the contribution of one of the methods is always useless. As can be seen in Section
4, this will not happen, so we can confirm that both methods contribute positively to head pose estimation.
Once the problem is well defined, we have to choose an optimization algorithm. Since our problem is very simple, e.g., linear, unconstrained, and two-dimensional, we choose the Nelder-Mead algorithm we will introduce. The
Nelder-Mead Method (
NMM) [
16], a well-known linear optimization technique, employs the concept of a simplex, a polytope with a fixed number of vertices as
\(n+1\) , where
n is the number of variables, which in our case is two. The method uses four parameters: the coefficient of reflection
\(\rho \gt 0\) , the expansion
\(\chi \gt 1\) , the contraction
\(0\lt \gamma \lt 1\) and the shrinkage
\(0\lt \theta \lt 1\) . Those parameters are set, as usual, at
\(\rho =1\) ,
\(\chi =2\) ,
\(\gamma =1/2\) and
\(\theta =1/2\) . At each iteration of the NMM, the centroid of the polytope is computed as
where
\(v_i\) are the vertices of the polytope. In particular, they are ordered in such a way that the worst vertex, the ones that give us the maximum values for a problem of minimum, is the last one, excluded from the centroid calculation. Then a reflected point is created with the following operation
Its function is then calculated as
\(f_r = f(x_r)\) , where
\(f_i\) is the function value to minimize at vertex
i. At this point, there are several cases that lead to different steps. If
\(f_r\) is a value lower than all the previous ones, it means that we are moving in the right direction, so we try to minimize the function even more, using a function called expansion
and we proceed with this new value. If
\(f_r\) is a value between the other ones, we reject the worst vertex,
\(v_{n+1}\) and the
\(v_{n}\) becomes the new
\(v_{n+1}\) . Finally,
\(f_r\) could be used as a separator between the
\(f_n\) and
\(f_n+1\) values.In this case, we will operate with a function called contraction
to try to find a lower value. In all other cases, we will consider a function called reflection, defined as
to try to find, also in this case, a lower value. In each of these cases, when a lower value, with respect to the general
\(f_{n+1}\) value, is found, the associated vertex is added to the simplex, and the worst is deleted.
A classical problem in optimization methods is the choice of the initial point, or, in other words, the first point
\((a,b)\) on which the optimization starts. As investigated by [
30], in the NMM, the choice of the initial point/simplex is a crucial point. In particular, they highlight that the orientation of the initial simplex has a significant effect on efficiency. In addition, an automatic predictor cannot be used in this case to provide sufficient accuracy.
No known initial simplex significantly improves outcomes, so the authors suggest trying different approaches. The randomized bounds method, proposed by [
10], is one of the possible methods.
This method is particularly useful when the search parameters are in a limited range. In this case, the authors suggest using pseudo-random numbers, where,
\(m_j\) and
\(M_j\) are defined as the minimum and maximum values of the variables. The value of the first variable,
\(a_0\) is randomly fixed in its interval, while the second variable,
\(b_0\) in our case, is defined as
where
\(\theta _b\) is a random number between 0 and 1. In our case, from the control set, considering the differences between the ground truth and the predicted values of each method, we obtained
\(m_a=0, M_a=2\) and
\(m_b=0,M_b=2\) . Then, we proceed as suggested by the pseudo-randomized method to search for the initial points. In particular, we noticed that, by fixing one of the variables to similar values of the other variable, we obtained similar performance results. For this reason, instead of proceeding with a completely random choice of the first variable, we proceed with a binary search over the first variable in its interval. The entire process, starting from the initial point selection and ending with the optimization results, takes 0.000598 seconds. This time should be multiplied by the maximum number of iterations to be performed, which in our case ranges from 0 to 2 for each variable, with a confidence level of 0.01. This yields a total of 200 possible values for each variable, resulting in a maximum number of iterations of
\(200log(200) = 1059.66\) for the aforementioned method. The total time required to obtain the optimal values of
a and
b over the control set is 0.63 seconds. This is more than reasonable if we consider that this step needs to be performed only if the source of the data, e.g., the sensors, changes. In the following sections, we present the numeric values obtained by the application of those methods to Pandora, and also their generalization to Biwi and ICT-3DHP.
While the original formulation of the Nelder-Mean method does not always satisfy all of the convergence conditions (the simplex should remains uniformly nondegenerate & each iteration must include some kind of “sufficient” descent criterion for function values at the vertices), it has been proved by several researcher that the method is able to converge for small dimensional problems, as in our case [
24].