research-article

Open access

IoT-enabled Biometric Security: Enhancing Smart Car Safety with Depth-based Head Pose Estimation

Authors: Carmen Bisogni, Lucia Cascone, Michele Nappi, Chiara PeroAuthors Info & Claims

ACM Transactions on Multimedia Computing, Communications and Applications, Volume 20, Issue 6

Article No.: 155, Pages 1 - 24

https://doi.org/10.1145/3639367

Published: 08 March 2024 Publication History

PDF eReader

Abstract

Advanced Driver Assistance Systems (ADAS) are experiencing higher levels of automation, facilitated by the synergy among various sensors integrated within vehicles, thereby forming an Internet of Things (IoT) framework. Among these sensors, cameras have emerged as valuable tools for detecting driver fatigue and distraction. This study introduces HYDE-F, a Head Pose Estimation (HPE) system exclusively utilizing depth cameras. HYDE-F adeptly identifies critical driver head poses associated with risky conditions, thus enhancing the safety of IoT-enabled ADAS. The core of HYDE-F’s innovation lies in its dual-process approach: it employs a fractal encoding technique and keypoint intensity analysis in parallel. These two processes are then fused using an optimization algorithm, enabling HYDE-F to blend the strengths of both methods for enhanced accuracy. Evaluations conducted on a specialized driving dataset, Pandora, demonstrate HYDE-F’s competitive performance compared to existing methods, surpassing current techniques in terms of average Mean Absolute Error (MAE) by nearly 1^∘. Moreover, case studies highlight the successful integration of HYDE-F with vehicle sensors. Additionally, HYDE-F exhibits robust generalization capabilities, as evidenced by experiments conducted on standard laboratory-based HPE datasets, i.e., Biwi and ICT-3DHP databases, achieving an average MAE of 4.9^∘ and 5^∘, respectively.

1 Introduction

There exists a concrete possibility that fully autonomous vehicles will become widely accessible in the coming decades, although existing obstacles currently impede their widespread adoption. Meanwhile, the domain of connected vehicles emerges as a dynamic and expansive area in IoT research, offering numerous advantages, including improved road safety, smarter traffic control, and enhanced passenger entertainment, among other benefits.

Advanced driver assistance systems (ADAS) are based on the integration of information from different sensors, radars, cameras, and on-board computers. Through a combination of data and information, they communicate directly with the driver by launching automated pre-alarm signals and warnings or by acting directly on the accelerator, brake, and steering. The incorporation of multisource data, particularly biometric-based data specifically designed for driving scenarios and integrated with the available vehicle sensors, can significantly enhance the security of ADAS. One of the primary challenges faced by biometrics-based applications in ADAS is the considerable variation in illumination encountered during the driving experience. Algorithms designed to monitor the driver’s attentive state through cameras, such as head pose estimation, are predominantly tested on RGB images and focus on challenges that are not frequently encountered in driving scenarios, such as strong occlusions, changes in the background, extreme poses, and so on. Conversely, biometric algorithms that utilize depth cameras or simulate driving scenarios, which can have significant relevance in an ADAS, are more challenging to come across. For this reason, this research is motivated by the need to bridge this gap by focusing specifically on depth-images based head pose estimation. By leveraging depth cameras and fractal encoding techniques, we aim at developing a method that is uniquely suited to address the challenges encountered in actual driving environments where factors like varying lighting conditions are commonplace. In this environment, which is strongly affected by illumination variations and low illumination, depth images are preferable to RGB images. Depth images, in fact, can be acquired in the dark since they are based on the distance of the subject from the camera. The most recent sensor involved in those acquisition is the Kinect. In particular, the Kinect 2 is the more recent one, used for driver depth datasets as in Pandora [9], and the Kinect 1 for other depth datasets as in Biwi [15] and ICT-3DHP [5]. In addition, one of the key considerations in this research is the resource efficiency of the chosen approach. Deep learning models, while powerful, often demand substantial computational resources. These resources may not be readily available in the embedded systems commonly used in vehicles. In contrast, rule-based systems offer a more lightweight and efficient solution in terms of processing power and memory usage. This is particularly crucial for replicating real-world scenarios, where computational constraints are a pertinent factor. Moreover, in safety-critical applications like ADAS, transparency and predictability in the decision-making process are paramount. Rule-based systems provide a clear and interpretable framework, which is advantageous for accountability and liability purposes. This level of understanding ensures that the system’s behavior can be traced and comprehended, thereby enhancing overall safety measures. In this article, we focus on the value that a depth head pose recognition system can add to an ADAS in terms of the security of the delivered output when integrated with a vehicle‘s sensors in an IoT context. We propose a novel fusion strategy, namely HYDE-F, aimed at enhancing HPE accuracy by synergistically combining the strengths of fractal encoding and keypoint intensity methods. The fractal encoding has proven useful in determining the head pose from RGB data [6] [7]. Because the typical PIFS pattern can be seen also on depth images, we chose to test PIFS on driving scenarios for depth data. On the other side, a depth camera’s main feature is its ability to offer distances in the form of depth information. The distance is proportional to the graylevel intensity of the pixel. If all of the pixels needed to be used to estimate a subject’s identity, we chose only the pixel of the facial keypoint in the case of the head pose, because the variations in intensity allow us to maximize the information while minimizing the effort to handle the data (an array of 68 instead of a matrix 224 \(\times\) 224).

Recognizing the inherent differences in data formats and output lengths of these two approaches, we employ a score-level fusion technique. This method is particularly well-suited for integrating outputs from distinct methodologies without requiring uniform data structures. Our approach involves formulating the fusion task as a set of three distinct optimization problems, corresponding to the three components of head pose: pitch, yaw, and roll. Each problem aims at minimizing the discrepancy between the combined predictions methods and the corresponding ground truth values. We employ the Nelder-Mead algorithm, a choice driven by its efficacy in dealing with linear, unconstrained, and bi-dimensional optimization challenges. A critical aspect of our strategy is the initialization phase, where we deploy a pseudo-randomized method to determine the starting points for the optimization. This choice is pivotal to the overall efficiency and effectiveness of the algorithm. The optimization process itself is iterative, seeking the optimal combination of coefficients for each head pose component that aligns closely with the ground truth, thereby reducing the overall prediction error.

The HYDE-F workflow can be appreciated in Figure 1, where different phases involving landmark extraction, fractal encoding, keypoint intensity, and fusion can be distinguished. The main contributions of HYDE-F can be summarized as follows:

Fig. 1.

—

HYDE-F uses a combination of keypoint intensity and fractal encoding techniques, proposed for the first time on depth images in this article.

—

HYDE-F performs a fusion technique able to estimate the optimal parameters of the fusion on new kinds of data in less than one second. This transforms the fusion into a linear combination of the predictions.

—

HYDE-F does not involve RGB images in any of its steps.

—

HYDE-F is the method with the best performance in the real-world driver scenario represented by the Pandora dataset.

In the following sections, we will analyse those contributions in detail. In particular, we will present related works on depth head pose estimation in Section 2. In Section 3 we will describe the theoretical background of the proposed methods and their application to the field of HPE on depth images. In Section 4 the datasets, the experiments, and the comparisons with the state-of-the-art are presented, together with a discussion of the obtained results. Finally, we have drawn our conclusions and plans for future work in Section 6.

2 Related Works

2.1 IoT Sensors for Smart Car Safety

This subsection provides an overview of several articles that examine driver fatigue and distraction by analyzing facial features, particularly the head angle. In [4] the authors employed a machine learning classifier to detect yawning as an indicator of driver fatigue, achieving an accuracy rate of 98%. However, while the proposed open-mouth edge detector yields noteworthy results, it does not account for real-world scenarios where factors such as noise, occlusions, and varying lighting conditions may potentially impact its performance. In [32], an eye blink tracking algorithm was developed that used eye characteristic points to determine whether the driver’s eyes were open or closed and sounded an alarm when drowsiness was detected. Another work [42] involves the use of infrared videos to detect driver fatigue, and they proposed an eye state recognition method based on Convolution Neural Network (CNN), which calculated the percentage of eyelid closure over the pupil over time and blink frequency. In conditions of poor illumination, both approaches performances are compromised, even when employing a high-resolution camera. Additionally, the presence of glasses can disrupt the accurate detection of eye states. Liu et al. [26] presented an algorithm aimed at detecting driver fatigue. This algorithm utilizes two-stream network models, which are capable of combining both static and dynamic image data, as well as multi-facial features that focus on data relevant to fatigue. [29] developed a driver monitoring algorithm that relies on video evidence captured by a camera mounted in the dashboard of the car. Each frame is examined by this algorithm for the driver’s head posture. Most of the previously mentioned studies rely solely on static images. However, a single frame proves insufficient for, for example, accurate driver drowsiness detection; this task could necessitate action recognition rather than image recognition.

There is a wide range of research on detecting distraction, and numerous approaches have been suggested. In the case of [3], the authors suggest using a descriptor that merges four significant head descriptors based on orientation. They conducted a comparative analysis using well-known feature selection algorithms to obtain a subset, which was subsequently fed into a classification process employing the Support Vector Machine (SVM). The SVM was able to learn variations in head pose, achieving an accuracy rate of 97.5% for pitch and 98.2% for yaw. Using the Kinect camera, the study in [41] monitored drivers and identified seven common driving tasks performed in real-time. Driver behavior signals were collected, including a color and depth image of the driver inside the vehicle cabin, 3-D head rotation angles, and upper body joint positions. The researchers evaluated the significance of these features in recognizing driver behavior by utilizing random forests and maximal information coefficient techniques. In general, the main limitations highlighted in the state-of-the-art works include the utilization of visible spectrum cameras, which may encounter perturbations in acquisition during the night. Moreover, in literature, the prevalent focus has been on two types of driver inattention: fatigue and distraction, while the influence of emotions on driving performance remains relatively unexplored. For further information on driver inattention monitoring systems for smart cars, additional details can be found in [38]. Recently, the use of blink-related features extracted from electroencephalography (EEG) has become a significant metric of a driver’s cognitive state to enhance drowsiness detection [36]. Different multimodal systems, including the fusion of numerous sensors, have emerged to detect driver fatigue, drowsiness, distraction, and inattention. For instance, in addition to EEG, in [20] the authors also investigate electrooculography (EOG) signals. An in-depth analysis of vehicle measures, facial and body expressions, and physiological signals to improve driving safety through adaptive interactions with the driver is discussed in [2].

2.2 Advances and Applications of Head Pose Estimation Technology

In the field of HPE for depth images, it is not unusual to perform a subject-based 3D modeling step before estimating the head pose. For this reason, to better understand the applicability and application fields of the works compared in Section 4.1.3, we will introduce those methods using this differentiation. Head pose estimation tasks can be greatly helped by 3D modeling. This field is strictly related to head tracking, as introduced in [5] and further developed in other publications as [11]. Through modeling, the method became subject-dependent. This is because, in order to adapt a generic model to the subject and avoid misclassification due to different head shapes, a morphable 3D shape is adapted to the subject prior to estimating the head pose. In the work proposed by Sheng et al. [37], the authors focused on solving the problem of occlusions and facial expression variations. Through the point cloud and depth image, a subject-dependent 3D reference model is built, and the rigid pose estimate is extracted. Again, the identity adaptation is done online. It is widely known in literature that incorporating 3D modeling in head pose estimation tasks includes some limitations, such as the need for subject-specific adaptation, which can be time-consuming and less feasible for real-time applications. Despite enhancing robustness, challenges may still arise with facial expressions, occlusions, and dynamic environments. Extreme head poses and the quality of depth data also pose potential complexities for the effectiveness of 3D modeling techniques.

Methods that do not employ 3D modeling can effectively operate with just a single frame, which proves particularly interesting in emerging applications like driver attention monitoring. While 3D modeling may enhance accuracy in specific cases, managing both online and offline models simultaneously can be challenging in practical contexts. Therefore, subject-independent methods that bypass 3D modeling continue to achieve notable success. Li et al. [25] used RGB images combined with depth images to obtain the pose. An initial set of pose parameters is iteratively optimized using the Extended LM method and filtered using Kalman to obtain the final prediction. In contrast to the proposed approach, HYDE-F does not incorporate RGB images at any point throughout its operation. This characteristic gives a possible benefit in situations when RGB data may be unavailable or less reliable. Furthermore, it is worth noting that HYDE-F demonstrates superior performance in real-world driving situations, as will be demonstrated in the subsequent sections. In contrast to the methodology outlined in this article, the aforementioned approach does not assert any assertions pertaining to real-world efficacy. The focus of the method proposed by Wang et al. [39], is not only the head pose but also the landmark detection in depth images. They first use discriminative random regression forests to estimate head location. Then, using a regression method, they created a point cloud and extracted feature differences between the samples between the point clouds in the reference and the point cloud in the input. The method emphasizes joint optimization with classification-guided techniques. However, the potential for applications in various “uncontrolled” contexts is not explored. As we demonstrate, versatility constitutes a significant asset in diverse driving scenarios. The use of a point cloud is also present in the work of Xiao et al. [40]. They built this cloud from a single depth image and fed a feature learning network with the latter. Three ranking nets are implemented for the prediction of pitch, yaw, and roll, with the final one being the average of those results. The same approach of using a point cloud to solve the HPE problem from a single depth image is also used in the work of Ma et al. [27]. In this case, they proposed a Deep Regression Forest with Soft-Attention Network with the same feature learning network introduced by [40]. In spite of the utilization of advanced and customized deep learning methodologies, our analysis will illustrate that the performance exhibited on the Pandora dataset continues to be subpar when compared to the outcomes attained by the HYDE-F strategy. This finding highlights that, despite the application of deep learning techniques, HYDE-F consistently outperforms other methods, particularly in the context of this specific dataset. Finally, we report the work proposed by Borghi et al. [9], recently improved by the same authors in [8], that applied the HPE to the driver environment. In this case, they used a deep network to hallucinate the RGB image from depth and then three convolutional neural networks to estimate the head pose using the false RGB image, two consecutive depth images elaborated with motion, and the original depth image. The results obtained are fused to obtain the final head pose.

3 HYDE-F Framework

The method we propose, HYDE-F, can be summarized as in Figure 1. Our input is a depth image. First, the head is detected by the prediction of facial keypoints on the image. Then, two processes of head pose estimation work in parallel. In particular, a fractal encoding method that, starting from a mask of the depth image, uses the self-similarity property to estimate the head pose and uses the keypoint intensities of the depth image as a characteristic feature of the head pose. The two predictions made by the two methods are then fused. The fusion process involves the use of an optimization technique to discover the parameter that balances the two predictions in the proper way. All of the theoretical background and their applications in HYDE-F will be presented in the following sections.

3.1 Depth Landmarks Prediction

The first step to be performed in HYDE-F is the prediction of the facial landmarks in the image. Existing techniques in this field, specifically built to detect depth landmarks, are in most cases limited to a few keypoints located in the inner part of the face [12],[17]. In our case, we need more landmarks, both located on facial boundaries for fractal encoding, than the inner part of the face to compute the depth intensities. To reach this aim, we will adapt a method working with RGB images to depth images. The baseline we used is based on the algorithm by Kazemi and Sullivan [23] to detect facial landmarks on RGB images. This method uses an ensemble of regression trees trained on a set of manually annotated faces with landmark coordinates. In our case, to make the method able to work on depth images, we retrained the landmark predictor by using as ground truth the landmarks of the RGB images on which we used the classical version by [23]. Since Biwi and ICT-3DHP provide overlapping landmark coordinates between the RGB and the depth images, 20% of the images from the latter have been used to retrain the algorithm. The method as obtained works by using only the depth image to return the 68 facial landmarks, 17 landmarks on the edge of the face, 5 landmarks for each eyebrow (10), 6 landmarks for each eye (12), 9 landmarks for the nose, and 20 landmarks for the mouth. Since it is a predictor, as the corresponding method in RGB, the number of landmarks detected will always be 68, even in the case of partial occlusion. This property will make HYDE-F able to handle occlusions both deriving from objects (e.g., hair, hands, phones) and because of the head pose, also in extreme cases (e.g., profile).

3.2 Fractal Encoding

After the preprocessing step described in Section 3.1, the Fractal Encoding module is applied. First of all, from the landmark detected, the module creates a mask of the depth image. The landmarks are seen as the inner points of a polygon. The mask is then created by building the convex polygon containing the landmarks. This step is essential in the case of fractal encoding since, as we will describe in detail, fractal encoding is based on the concept of self-similarity. In fact, self-similarity between blocks is enhanced by the huge differences between the black and grayscale parts of the mask images. Self-similarity is a typical property of fractals, derived from the definition of a self-similar object that Peitgen defined in [1] as follows:

If parts of a figure are small replicas of the whole, then the figure is called self-similar. A figure is strictly self-similar if it can be decomposed into parts that are exact replicas of the whole. Any arbitrary part contains an exact replica of the whole figure.

Using the well-known fractal coding scheme, originally devised as a lossy image compression algorithm, it is possible to explore the self-similarity properties of two depth frames representing pose variations with a similar head rotation. In particular, by determining the forms’ redundancy at different scaling factors, it is possible to obtain an estimate as close to the source frame as possible. Hence, face position variations can be efficiently described in terms of their intrinsic auto-similarities. It is important to underline that the fractal coding approach achieved high pose estimation accuracy in the RGB domain, rivaling the performance of the best machine learning-based approaches [7].

Fractal compression was first proposed by Jacquin in 1989 [21]. He described a method called the Partitioned Iterated Functions System (PIFS) with the following steps:

(1)

Partition the original image into a set of blocks of size nxn, named Range blocks, and indicated with \(R_i\) .

(2)

For each range block \(R_i\) , search a block \(D_i\) obtained by partitioning the original image into 2nx2n blocks, such that \(R_i\) and \(D_i\) are similar. The set of blocks obtained will be called Domain blocks.

(3)

Find functions \(H = \lbrace f_1, \ldots , f_n\rbrace\) where \(H(D_i)\) transforms domain \(D_i\) into range block \(R_i\) .

The functions to obtain the range blocks from the domain blocks, can not be ordinary ones. In fact, to guarantee self-similarity, the features must be rescaled, along the x and y axes, by affine transformations. Be x an array, an affine transformation is defined as

\begin{equation} f(x)=Ax+b , \end{equation}

(1)

where A is a matrix, often representing a rotation, and b is the translation array. In particular, in case of the grayscale image, A is called the transformation and b the offset vector. The array x will be, in this case, composed by the \((x,y)\) coordinates and z gray level of the image.

In addition, since the functions in which we are interested must transform blocks of dimension 2nx2n in blocks of dimension nxn, we have to look for contractive functions, also called contraction mapping defined as follows:

\begin{equation} d(f(x),f(y))\le k*d(x,y) . \end{equation}

(2)

In other words, for each couple \((x,y)\) of points in the grayscale image, after the transformation, their distance d must be increased k-fold, where k is a fixed value between 0 and 1.

Finally, to ensure that iteratively proceeding through those transformations will obtain a closer and closer representation of the original image, we have to observe that the contractive transformations satisfy the hypotheses of the fixed point theorem defined as follows:

Fixed Point Theorem

In a complete metric space \((M,d)\) if \(f:M \rightarrow M\) is a contractive transformation with parameter k, then exists and is unique, a fixed point \(x_i \in M\) such that

\begin{equation} f(x_i)=x_i , \end{equation}

(3)

and for any point x in M is also true

\begin{equation} \lim _{n\rightarrow \infty } f^n(x)=\lim _{n\rightarrow \infty } \underbrace{f(f(f(...(x))))}_{n \: times}=x_i . \end{equation}

(4)

For this reason, the iterations of PIFS are also called fixed point iterations.

The fractal encoding described produces a matrix representing the transformation to be performed on each domain block to obtain the range block. This transformation is used in fractal compression to build the compressed image. In our case, we are not operating with the aim of compression, and we will use the obtained matrix as the feature to perform HPE. We also demonstrate with experiments that we will later introduce that, in some cases, for some particular sets of data, it is more effective to proceed with a modified version of the fractal encoding that we can call frontal fractal encoding. In this particular version, instead of using the same image in the fractal module to build the domain and range blocks, we will use a random frontal face as a reference.

In this case, we can observe that, instead of a proper self-similarity, we are actually searching for a frontal-similarity. In fact, we are searching for those particular transformations that are able to codify a generic image as a frontal one. The particularity is that, to make this method subject-invariant, we will use a random frontal image, which can even come from a very different subject (e.g., different facial traits, gender, or age).

After we described how to obtain the encodings, we have to clarify how it is possible to obtain the head pose estimation from the latter. In fact, we will use a portion of the datasets to build a reference model. The reference model is obtained using a set of head pose images with a wide range of variations. Those images are encoded with the method above, and an encoded matrix is thus obtained from each image. Each matrix is then converted into an array by concatenating its rows. The motivation for this step is the fact that, once we have an array for each image, we can use each image as a row of a matrix that we will call the template. This template has a number of rows equal to the number of images in the set and a number of columns that depend on the dimensions of the images taken into consideration. Each transformation block contains two elements as the coordinates of the range block: one element representing the inversion, one representing the rotation, one representing the contrast, and one representing the brightness.

If we define as NxN the dimension of the original image and nxn the dimension of the range blocks, the number of rows of the original encoding matrix will be \(N/n\) . As a consequence, the number of columns in the template will be \((N/n)*6\) . At each element of the template, the head pose ground truth in pitch, yaw, and roll is associated and stored as a separate file. Once this reference template is built, when an image comes in as input, it will be encoded following the same scheme and its fractal encoding matrix will be transformed into an array. This array will be compared with the template mentioned above. The study about the comparison of the input array with the array in the template is performed following the distances and protocols introduced in Section 3.4.

3.3 Keypoint Intensity

Utilizing the intensity of keypoints in the depth image for head pose estimation provides numerous benefits, particularly in real-time driving scenarios. First, keypoint intensity permits a more accurate and efficient estimation of the orientation and position of the head in 3D space. In a driving scenario, the lighting conditions can vary significantly, which can affect the algorithm’s precision. By utilizing the intensity of keypoints in the depth image, the algorithm becomes less sensitive to these variations and can produce more accurate results. Second, the intensity of keypoints can be used to track specific head features, such as the tip of the nose, the center of the eyes, or the corners of the mouth. This information is especially useful for estimating head pose because it enables the algorithm to determine the head’s position and orientation with greater precision. In addition, the use of intensity-based keypoints can reduce the computational complexity of the system in comparison to the use of full depth maps. This reduction in complexity is especially advantageous for real-time applications, such as driving scenarios, in which the head pose estimation algorithm must process data quickly and effectively. In our system, after the landmarks are detected on the depth image, the corresponding levels of intensity are extracted from their coordinates. For each of the 68 landmarks detected by the preprocessing step, it is associated with the depth gray level.

As previously described in Section 3.2, the template is built using a subset of the datasets as a reference. Here, in particular, for each image of the reference, the preprocessing step is performed, and the gray levels of the 68 landmarks are saved in an array. The template is then composed of N rows, representing the N images in the reference, and 68 columns, representing the 68 gray levels of the depth image. Each row is associated with the head pose that it represents, stored in a separate file.

Using only the keypoints instead of the complete depth image can be both an advantage and a disadvantage. In particular, from a computational point of view, we have only 68 values to compare for each reference. On the other hand, if some of those 68 values are not good, we lose significant information. Some of the 68 values may fall too far away from the camera to perform the comparisons when the subject is over the sensitivity of the depth sensors. In this case, due to the limitation in range of the depth camera, some of the keypoints fall into a dark-zone where there is no depth information other than the background, defined as black. This is the case for some datasets built using old depth techniques such as Kinect 1, with a limited range and an automatic background segmentation that removes each part of the subject that follows at a distance higher than a fixed value (in meters). This effect will be seen in the Biwi dataset in Section refothers. To avoid this problem, after the landmark extraction, if more than three keypoint intensities are equal to 0, e.g., because information is missing, we will exclude those sets of keypoints from our evaluations. It is important to underline that this happens only when data is captured with old devices. In new devices, such as Kinect 2, all the keypoint intensities are able to pass this test, and no information or frames are missing in this case.

In addition, another problem related to low resource sensors is the depth at which the sensor is able to collect data. In another dataset, ICT-3DHP, which we will discuss in experiments, we noticed that the information inside the face, e.g., the gray levels, appeared very flat. In fact, once we calculated its entropy, using the classical Shannon entropy formula, we obtained a mean value of 1.55 in the dataset. If we consider that the mean value of Pandora, which we also experimented with, is 3.55, it is clear that there is a huge difference between the datasets in terms of information. To alleviate this phenomenon, when the amount of information carried by the image is too low, we use not only the keypoint intensities, also their spatial coordinates. In fact, we will compute the 68 landmarks as 3-D points in the space as \(K(x,y,i)\) where \(x,y\) are spatial coordinates, and i is the intensity of the gray level at this point. We will then substitute the previous keypoint intensity with the following:

\begin{equation} h_j=\sqrt {(x_i-x_{33})^2+(y_i-y_{33})^2+(i_j-i_{33})^2} , \end{equation}

(5)

where \((x_{33},y_{33},i_{33})\) represent the nose, that is the 33 \(^\circ\) landmark. This makes \(h_j\) the 3-D distance of the jth keypoint from the nose. The new array is then created following this rule, as well as the reference template. To decide if the spatial information is necessary, as introduced, we used the entropy level. In particular, we empirically set the threshold of the entropy level to 2, based on the observations made on the dataset above. This threshold represents the point that divides the image for which spatial information is essential from the image for which these features become pejorative.

After the Template is built with one or the other technique, for the input image, we will follow the same path, and then we will compare the array of the input image with the ones stored in the Template. To do this, as in the case of the fractal encoding, we tested various distances that we will introduce in the next section.

3.4 Pose Classification

Each of the methods introduced creates a template. Once the template is built, we need a metric that is able to enhance the differences between different head poses and minimize the differences between the same head poses. To reach this aim, we tested the distances presented in Table 1, where, in particular, \(s_i\) and \(r_i\) are the elements of the input gray level values and the gray level values stored in the jth elements of the template, respectively.

Table 1.

Metric	Definition
Hamming	\(d_j=1 - \frac{n-\sum _i \gamma (s_i,r_i)}{n}, \gamma (s_i,r_i)= {\left\lbrace \begin{array}{ll} 1, & \mbox{if } s_i \ne r_i \\ 0 & \mbox{if } s_i = r_i \end{array}\right.}\)
Braycurtis	\(d_j=\frac{\sum _i \|s_i-r_i\|}{\sum _i (s_i+r_i)}\)
Cityblock	\(d_j=\sum _i \|s_i-r_i\|\)
Euclidean	\(d_j=\sqrt {\sum _i (s_i-r_i)^2}\)
Jaccard	\(d_j=1-\frac{s \cap r}{s \cup r}\) , where \(s \cap r\) and \(s \cup r\) are defined as the number of common and total elements, respectively
Correlation	\(d_j=1-\frac{(s-\bar{s}) \cdot (r-\bar{r})}{\Vert s-\bar{s} \Vert _2 \cdot \Vert r-\bar{r} \Vert _2}\) , with \(\bar{s}\) and \(\bar{r}\) as mean values
Canberra	\(d_j= \sum _i \frac{\| s_i-r_i\|}{\|s_i\|+\|r_i\|}\)
Cosine	\(d_j=1-\frac{s \cdot r}{\Vert s \Vert _2 \cdot \Vert r \Vert _2}\) , with \(s \cdot r\) as the dot product
Chebyshev	\(d_j=\max _i \|s_i-r_i\|\)

Table 1. The Distances we Tested for the Pose Classification Step

The result obtained once the distance to be used is chosen is a set of distances of the same length as the arrays in the Template. The pose associated with the minimum distance obtained will be set as the prediction as follows:

\begin{equation} HP=A[\min _j{d_j}] . \end{equation}

(6)

Where A is the array containing the pose associated with each element of the Template, \(d_j\) is the chosen distance, and HP is the final pose estimated.

The introduced distances give us different results, particularly significant on the keypoint intensity-based method. This because the array generated by the latter is smaller, leading to a bigger weight of the chosen metric on the final result. The results using those distances are shown in Sections 4.1.1 and 4.1.2.

3.5 HYDE-F Fusion

As described in the previous Sections, the fractal encoding method and the keypoint intensity method are able to perform the HPE, individually. However, we want to combine the responses of those two techniques to improve the final results. Since we want to compare the responses of each method, the fusion technique we use is in the field of score-level fusion techniques. Compared to other kinds of fusion, this is the most suitable to use when two different techniques are involved, as in our case. In particular, we have from the keypoint intensity a feature array of length 68 and, for the fractal encoding, a length that depends on the parameters we set. This size is, however, always different from 68. For this reason, it is not suitable to use feature-level fusion. Score level fusion is very popular in biometric techniques [34].

Here we propose a fusion technique based on optimization. In particular, as we will discuss in greater detail in the experiments section, we chose 20% of each dataset to be analyzed. For this subset, we can perform in parallel the fractal encoding of HPE and the keypoint intensity of HPE. In particular, we consider as \(KI=(p_{ki},y_{ki},r_{ki})\) and \(FE=(p_{fe},y_{fe},r_{fe})\) the prediction of the keypoints intensity and fractal encoding methods, respectively, where \(p_*, y_*, r_*\) represent the pitch, yaw and roll prediction arrays of each method. For the control set, we have the ground truth available, which we call \(GT=(p_{gt},y_{gt},r_{gt})\) . Our aim is to find a method to obtain a function f such that \(f(KI,FE)=GT\) . In particular, since the fractal encoding and the keypoint intensity have different behaviors along the three axes, we will split the problem into three parts. We are, thus, interested to find \(f_p, f_y\) , and \(f_r\) such that \(f_p(p_{ki},p_{fe})=p_{gt}\) , \(f_y(y_{ki},y_{fe})=y_{gt}\) and \(f_r(r_{ki},r_{fe})=r_{gt}\) . Since each method is able to obtain reasonable results along the three axes even if considered alone, we can assume the problem to be solved is a simple one, and we rewrite the problem as a linear

\begin{equation} \begin{split} f_p(p_{ki},p_{fe})=a_p*p_{ki}+b_p*p_{fe} \\ f_y(y_{ki},y_{fe})=a_y*y_{ki}+b_y*y_{fe} \\ f_r(r_{ki},r_{fe})=a_r*r_{ki}+b_r*r_{fe}. \end{split} \end{equation}

(7)

Ideally, we want to find the tuple \((a_p,a_y,a_r,b_p,b_y,b_r)\) that gives us the exact ground truth for every image. However, in a control set, depending on the datasets, they are stored in frames between \(2,\!500\) and \(10,\!000\) , which leads us to as many equations. For this reason, this problem cannot be solved as an equation system because it results in having no solutions. We will solve this problem as three optimization problems, trying to minimize the differences between the ground truth and the predicted values. The resulting problems are formulated as

\begin{equation} \begin{split} min_p |a_p*p_{ki}+b_p*p_{fe}-p_{gt}| \\ min_y |a_y*y_{ki}+b_y*y_{fe}-y_{gt}| \\ min_r |a_r*r_{ki}+b_r*r_{fe}-r{gt}|. \\ \end{split} \end{equation}

(8)

For our particular kind of problem, before choosing the optimization method to be applied, we can make some observations. First of all, for each image, we will obtain a different prediction. This means that \(p_*, y_*, r_*\) are arrays. Their length depends on the number of samples in the control set, which we can define as n. To obtain a singular function to be minimized, we can assume that our objective is to minimize the sum of the errors committed by the methods over all the data in the control set. For this reason, the functions to be minimized assume the following formulas:

\begin{equation} \begin{split} min_p |a_p*(p_{{ki}_1}+...p_{{ki}_n})+b_p*(p_{{fe}_1}+...+p_{{fe}_n})-(p_{{gt}_1}+...+p_{{gt}_n})| \\ min_y |a_y*(y_{{ki}_1}+...y_{{ki}_n})+b_y*(y_{{fe}_1}+...+y_{{fe}_n})-(y_{{gt}_1}+...+y_{{gt}_n})| \\ min_r |a_r*(r_{{ki}_1}+...r_{{ki}_n})+b_r*(r_{{fe}_1}+...+r_{{fe}_n})-(r_{{gt}_1}+...+r_{{gt}_n})|. \\ \end{split} \end{equation}

(9)

This observation significantly reduces the computation, allowing us to handle this problem as a linear, bi-dimensional problem. In addition, we do not have constraints, because the angular values can also be negative, and our only aim is to minimize the errors. We did not add any constraints to force the \(a_*\) and \(b_*\) values to be non-zero. This is because we also want to investigate if there are some topics on which the contribution of one of the methods is always useless. As can be seen in Section 4, this will not happen, so we can confirm that both methods contribute positively to head pose estimation.

Once the problem is well defined, we have to choose an optimization algorithm. Since our problem is very simple, e.g., linear, unconstrained, and two-dimensional, we choose the Nelder-Mead algorithm we will introduce. The Nelder-Mead Method (NMM) [16], a well-known linear optimization technique, employs the concept of a simplex, a polytope with a fixed number of vertices as \(n+1\) , where n is the number of variables, which in our case is two. The method uses four parameters: the coefficient of reflection \(\rho \gt 0\) , the expansion \(\chi \gt 1\) , the contraction \(0\lt \gamma \lt 1\) and the shrinkage \(0\lt \theta \lt 1\) . Those parameters are set, as usual, at \(\rho =1\) , \(\chi =2\) , \(\gamma =1/2\) and \(\theta =1/2\) . At each iteration of the NMM, the centroid of the polytope is computed as

\begin{equation} \bar{x}(n+1)=\frac{1}{n} \sum _{i=1}^{n} v_i , \end{equation}

(10)

where \(v_i\) are the vertices of the polytope. In particular, they are ordered in such a way that the worst vertex, the ones that give us the maximum values for a problem of minimum, is the last one, excluded from the centroid calculation. Then a reflected point is created with the following operation

\begin{equation} x_r=x(\rho ,n+1)=(1+\rho)*\bar{x}(n+1)-\rho *v_{n+1} . \end{equation}

(11)

Its function is then calculated as \(f_r = f(x_r)\) , where \(f_i\) is the function value to minimize at vertex i. At this point, there are several cases that lead to different steps. If \(f_r\) is a value lower than all the previous ones, it means that we are moving in the right direction, so we try to minimize the function even more, using a function called expansion

\begin{equation} x_e=x(\rho *\chi ,n+1)=(1+\rho *\chi)*\bar{x}(n+1)-\rho *\chi *v_{n+1} , \end{equation}

(12)

and we proceed with this new value. If \(f_r\) is a value between the other ones, we reject the worst vertex, \(v_{n+1}\) and the \(v_{n}\) becomes the new \(v_{n+1}\) . Finally, \(f_r\) could be used as a separator between the \(f_n\) and \(f_n+1\) values.In this case, we will operate with a function called contraction

\begin{equation} x_e=x(\rho *\gamma ,n+1)=(1+\rho *\gamma)*\bar{x}(n+1)-\rho *\gamma *v_{n+1} , \end{equation}

(13)

to try to find a lower value. In all other cases, we will consider a function called reflection, defined as

\begin{equation} x_r=x(-\gamma ,n+1)=(1-\gamma)*\bar{x}(n+1)-\gamma *v_{n+1} , \end{equation}

(14)

to try to find, also in this case, a lower value. In each of these cases, when a lower value, with respect to the general \(f_{n+1}\) value, is found, the associated vertex is added to the simplex, and the worst is deleted.

A classical problem in optimization methods is the choice of the initial point, or, in other words, the first point \((a,b)\) on which the optimization starts. As investigated by [30], in the NMM, the choice of the initial point/simplex is a crucial point. In particular, they highlight that the orientation of the initial simplex has a significant effect on efficiency. In addition, an automatic predictor cannot be used in this case to provide sufficient accuracy.

No known initial simplex significantly improves outcomes, so the authors suggest trying different approaches. The randomized bounds method, proposed by [10], is one of the possible methods.

This method is particularly useful when the search parameters are in a limited range. In this case, the authors suggest using pseudo-random numbers, where, \(m_j\) and \(M_j\) are defined as the minimum and maximum values of the variables. The value of the first variable, \(a_0\) is randomly fixed in its interval, while the second variable, \(b_0\) in our case, is defined as

\begin{equation} b_0=m_{b}+\theta _b (M_b-m_b) , \end{equation}

(15)

where \(\theta _b\) is a random number between 0 and 1. In our case, from the control set, considering the differences between the ground truth and the predicted values of each method, we obtained \(m_a=0, M_a=2\) and \(m_b=0,M_b=2\) . Then, we proceed as suggested by the pseudo-randomized method to search for the initial points. In particular, we noticed that, by fixing one of the variables to similar values of the other variable, we obtained similar performance results. For this reason, instead of proceeding with a completely random choice of the first variable, we proceed with a binary search over the first variable in its interval. The entire process, starting from the initial point selection and ending with the optimization results, takes 0.000598 seconds. This time should be multiplied by the maximum number of iterations to be performed, which in our case ranges from 0 to 2 for each variable, with a confidence level of 0.01. This yields a total of 200 possible values for each variable, resulting in a maximum number of iterations of \(200log(200) = 1059.66\) for the aforementioned method. The total time required to obtain the optimal values of a and b over the control set is 0.63 seconds. This is more than reasonable if we consider that this step needs to be performed only if the source of the data, e.g., the sensors, changes. In the following sections, we present the numeric values obtained by the application of those methods to Pandora, and also their generalization to Biwi and ICT-3DHP.

While the original formulation of the Nelder-Mean method does not always satisfy all of the convergence conditions (the simplex should remains uniformly nondegenerate & each iteration must include some kind of “sufficient” descent criterion for function values at the vertices), it has been proved by several researcher that the method is able to converge for small dimensional problems, as in our case [24].

4 Experimental Results

This section outlines the results from our framework, comparing individual and fused methods, and evaluating HYDE-F against state-of-the-art benchmarks.

4.1 Driving Environment

To perform our experiments in a simulated driving scenario, we utilized the Pandora dataset [9]. This database is the most recent and expansive among those introduced in this context. It was developed in 2017 using a Kinect 2 device, comprising both RGB (1920 \(\times\) 1080 pixels) and depth frames (512 \(\times\) 424 pixels). It includes a total of over 250,000 frames, capturing the upper bodies of 10 males and 12 females. In particular, it provides a diverse set of sequences, with each subject contributing five sequences, for a total of 110 sequences for the 22 subjects. This choice was influenced by the initial purpose of Pandora, which aimed at assisting in assessments within a simulated driving scenario. Subjects perform driving-like activities, such as looking at side mirrors and gripping the steering wheel. The dataset is challenging due to significant occlusions caused by clothing and objects (e.g., smartphones, tablets), as well as extreme head poses (roll: \(\pm\) 70 \(^{\circ }\) , pitch: \(\pm\) 100 \(^{\circ }\) , yaw: \(\pm\) 125 \(^{\circ }\) ), which have a significant impact on the appearance of subjects. Pandora images are able to pass through the threshold imposed by our method without requiring any additional considerations. This is due to the latest technology involved in acquiring the data, which makes them really suitable for the purpose. In the experiments, a “one-left-out” strategy is implemented for each database. This involves designating one subject exclusively for testing, while the remaining subjects serve as the reference model to conduct comparisons. In order to estimate the optimization values for the fusion, we selected for Pandora, a 20% of the subjects/sequences randomly as a control set.

4.1.1 Fractal Encoding Results.

To determine errors and evaluate results, we employ Mean Absolute Error (MAE) as the performance metric. MAE quantifies the difference between the predicted and actual values, given by the formula:

\begin{equation} MAE=\frac{1}{n}\sum _{j=1}^{n} |y_j-\hat{y_j}| . \end{equation}

(16)

Here, \(y_j\) represents the ground truth or the true angular value, and \(\hat{y_j}\) indicates the prediction, i.e., the estimated angular value. The MAE is computed individually for each of the three degrees of freedom, as well as an overall MAE along all three axes. The first results we will analyse were obtained using only fractal encoding. As previously introduced, the first step to perform is the choice of the metric distance to be used depending on the dataset. For Pandora, the results over the control set are shown in Table 2. Because of the very good quality of the dataset, the differences in terms of performances are minimal. However, even if for yaw angles the Jaccard distance is slightly better than the Canberra, the latter is more performant on two out of the three axes, and for this reason we selected it to proceed with the method.

Table 2.

Pandora - Fractal Encoding
Distance	Err_Pitch	Err_Yaw	Err_Roll
Hamming	5.79	6.19	5.54
Cityblock	6.98	6.55	6.53
Correlation	7.14	6.86	7.08
Chebyshev	6.68	7.54	7.41
Euclidean	7.21	6.87	6.87
Jaccard	5.68	6.17	5.49
Canberra	5.59	6.22	5.29
Braycurtis	6.73	6.51	6.58

Table 2. Distance Tested on Pandora Control Set, Fractal Encoding Method

4.1.2 Keypoint Intensity Results.

As we have seen for the Fractal Encoding algorithm, in this case, we tested several distances on the Pandora Control set. As can be appreciated, in Table 3, we have different behavior along the axes. In particular, the Hamming distance presents the best result on pitch, the Correlation distance on yaw, and the Canberra distance on roll. One could think about the possibility of using different distances for different axes. However, if we consider the mean error using three different distances, we obtain \(6.21^{\circ }\) , and on the other hand, if we consider the Hamming distance, which has the minimum mean error, we obtain \(6.51^{\circ }\) . This means that there is a \(0.30^{\circ }\) difference in error, which we believe is insufficient to warrant computing three distances rather than one. In addition, the results obtained will be fused with the ones from the fractal encoding. For this reason, the performances will improve during the additional step. After those observations, we decided to carry on our experiments on Pandora with the Hamming distance for the keypoint intensity method.

Table 3.

Pandora - Keypoint Intesity
Distance	Err_Pitch	Err_Yaw	Err_Roll
Hamming	4.62	8.15	6.77
Cityblock	5.72	8.31	6.44
Correlation	5.94	7.85	6.93
Chebyshev	7.30	8.27	7.23
Euclidean	8.69	10.27	10.48
Jaccard	13.25	14.91	8.34
Canberra	6.19	8.18	6.18
Braycurtis	5.72	8.36	6.44

Table 3. Distance Tested on Pandora Control Set, Keypoint Intesity Method

4.1.3 Fusion Results and Comparisons.

As described in Section 3.5, the way to achieve a good fusion between the fractal encoding method and the keypoint intensity method is to go through an optimization problem that depends on the source of the acquired data. Pandora is collected using a Kinect 2 under a driving scenario, and in order to improve the previous results, we will choose an appropriate set of parameters specific to this case. The parameters obtained for Pandora are

\begin{equation} \begin{split}a_P=0.3280 \hspace{56.9055pt} a_Y=0.4149 \hspace{56.9055pt} a_R=0.5775 \\ b_P=0.5611 \hspace{56.9055pt} b_Y=0.4835 \hspace{56.9055pt} b_R=0.3381 \end{split} . \end{equation}

(17)

Coefficients \(a_{P,Y,R}\) and \(b_{P,Y,R}\) refer to keypoint intensity and fractal encoding methods, respectively.

The parameters indicate a balance between keypoint intensity and fractal encoding for the final results, especially in yaw. This is not a coincidence; as shown in Table 4, HYDE-F performs very well on yaw when compared to the state-of-the-art. In the same table, we can also appreciate the improvement of the fusion over the single methods.

Table 4.

Comparisons Head Pose Estimation Results on Pandora
Method	Err_Pitch	Err_Yaw	Err_Roll	Err_Mean
Xiao et al. [40]	6.1	8.6	4.3	6.3
Ma et al. [27]	6.2	9.3	4.6	6.7
Borghi et al. [8]	7.3	10.3	4.6	7.4
Ma et al. [28]	6.4	9.6	4.9	6.9
SingleCNN [8]	6.5	10.4	5.4	7.4
DoubleCNN* [8]	5.6	9.8	4.9	6.7
DoubleCNN+ [8]	6.0	9.2	4.5	6.5
POSEidon \(^{\dagger }\) \({,}\) \(^{\bigtriangleup }\) [9]	5.7	9.0	4.9	7.1
Fractals	6.6	8.5	4.9	6.7
Keypoint Intensities	5.5	8.7	7.4	7.2
HYDE-F	4.8	6.8	4.7	5.4

Table 4. Comparisons with the State-of-the-art on Pandora

\(^{\dagger }\) =gray-level image reconstructed from depth map. \(^{\bigtriangleup }\) =motion images.

HYDE-F performed better than the state-of-the-art in mean, with an error almost 1 \(^\circ\) lower than the method proposed by Xiao et al. This is due to the huge difference, almost 2 \(^\circ\) , in terms of pitch and yaw when compared to other methods. This demonstrates the value of HYDE-F in the driver’s scenario.

4.2 Study of Generalizability the Environments

In order to demonstrate that HYDE-F is able to generalize in environments that are not specifically built for the driver scenario, we also present a set of experiments conducted on two classical HPE depth datasets, Biwi and ICT-3DHP.

4.2.1 HPE Depth Datasets.

The Biwi Kinect Head Pose is a dataset collected in 2011 [15]. Biwi is collected using a Kinect 1 camera that acquires both the RGB image and the depth image. During the acquisition, the subjects are asked to try to perform all the possible rotation angles. The subjects recorded total 20, for a total of more than 15 K images. There are six females and fourteen males. The head pose range covers about +-60 degrees pitch and +-75 degrees yaw. No information about range variation is provided for roll. However, the poses also present roll variations. The ICT 3D Head Pose Database [5], recorded in 2012 using a Kinect 1 like the Biwi dataset, includes RGB and depth images. It features 10 participants (6 males and 4 females) who moved their heads freely in front of the camera. The ground-truth head pose data was captured with a Polhemus Fastrack tracker attached to a cap worn by the participants. An example of the above mentioned datasets can be found in Figure 2.

Fig. 2.

4.2.2 Generalized Results.

Results from Biwi and ICT-3DHP will show HYDE-F’s generalization capabilities.

We followed the same path of the analysis of Pandora for the fractal encoding, and we discovered that the best distances for both Biwi and ICT-3DHP results were Canberra, along all three axes.

For the Keypoint intensity, Canberra yielded the best results for Biwi, while for ICT-3DHP, the optimal distance metric varied: Hamming for pitch, Chebyshev for yaw, and Canberra for roll. Despite the possibility of using different metrics for each axis, the overall mean error was \(8.11^{\circ }\) with three metrics versus \(8.99^{\circ }\) with only Hamming. The negligible difference of \(0.88^{\circ }\) in error suggests that it is more efficient to use a single distance metric, Hamming, rather than computing three separate ones. Additionally, we will merge the results with those from fractal encoding, which allows for compensation of the higher roll errors observed with the Hamming distance. Consequently, our ICT-3DHP experiments will utilize the Hamming distance for keypoint intensity analysis. We also optimized fusion parameters: for Biwi, they are \(a_P=0.2980\) , \(b_P=0.5879\) ; \(a_Y=0.6508\) , \(b_Y=0.2726\) ; \(a_R=0.6410\) , \(b_R=0.2814\) . For ICT-3DHP, the parameters are \(a^{\prime }_P=0.5038\) , \(b^{\prime }_P=0.4040\) ; \(a^{\prime }_Y=0.5471\) , \(b^{\prime }_Y=0.3653\) ; \(a^{\prime }_R=0.8431\) , \(b^{\prime }_R=0.1008\) .

In the Biwi dataset, we observe a significant imbalance in pitch, with fractal encoding deemed more reliable ( \(a_P \lt b_P\) ). Similarly, yaw displays an imbalance, favoring the intensities as more reliable ( \(a_Y \gt b_Y\) ), and the same trend is noted in roll ( \(a_R \gt b_R\) ). In contrast, the ICT-3DHP dataset shows more balanced parameters, except for roll, where keypoint intensity significantly outweighs fractal encoding in reliability ( \(a^{\prime }_R \gg b^{\prime }_R\) ).

Tables 5 and 6 show results from individual methods and their fusion and compare these with the current state-of-the-art. Methods using 3D modeling are marked with “*”, and those using RGB information with “+”. Our HYDE-F method is notable for its independence from RGB images and functions independently of 3D modeling. From Table 5, it’s evident that HYDE-F, initially designed for driving scenarios like in Pandora, still yields acceptable results on the Biwi dataset. This dataset is distinct because elements more than 1.2 meters away are omitted, unlike in Pandora where complete information is available. This dataset construction impacts keypoint intensity (as some keypoints fall in the removed background) and fractal encoding (due to missing parts in the self-similarity process). We highlight that the method was applied to the Biwi dataset in its original form as used in Pandora, without any adjustments to compensate for information gaps, such as using multiple frames.

Table 5.

Comparisons Head Pose Estimation Results on Biwi
Method	Err_Pitch	Err_Yaw	Err_Roll	Err_Mean
Xiao et al. [40]	2.2	2.3	1.6	2.0
Ma et al. [27]	1.5	1.6	1.4	1.5
Sheng et al. [37] \(^{*}\)	2	2.3	1.9	2.0
Li et al. [25] \(^{+}\)	2.5	2.7	2.8	2.7
Wang et al. [39]	1.2	1.3	1.5	1.3
Borghi et al. [8]	2.4	2.9	2.6	2.6
Fractals	8.4	6.9	5.9	7.1
Keypoint Intensities	5.9	8.2	7.5	7.2
HYDE-F	5.1	5.3	4.5	4.9

Table 5. Comparisons with the State-of-the-art on Biwi

\(^{*}\) =3D modeling. \(^{+}\) =RGB features.

Table 6.

Comparisons Head Pose Estimation Results on ICT-3DHP
Method	Err_Pitch	Err_Yaw	Err_Roll	Err_Mean
Li et al. [25] \(^{+}\)	3.1	3.3	2.9	3.1
Sheng et al. [37] \(^{*}\)	3.2	3.4	3.3	3.3
Borghi et al. [8]	4.9	6.8	3.5	5.0
POSEidon \(^{\dagger }\) \(^{,}\) \({^\bigtriangleup }\) [9]	5.0	7.1	3.5	5.2
Fractals	5.8	7.4	6.7	6.6
Keypoint Intensities	6.9	9.9	15.2	10.7
HYDE-F	5.0	6.4	3.5	5.0

Table 6. Comparisons with the State-of-the-art on ICT-3DHP

\(^{*}\) =3D modeling. \(^{+}\) =RGB features. \(^{\dagger }\) =gray-level image reconstructed from depth map. \(^{\bigtriangleup }\) =motion images.

On the ICT-3DHP dataset (Table 6), HYDE-F performs comparably to the method by Borghi et al. [8], which, to our knowledge, is the sole method using only depth information. Notably, unlike the Biwi dataset, ICT-3DHP data is not limited to a 1.2-meter threshold. The efficiency of HYDE-F is also noteworthy, with a total computing time of just 1.01 seconds per image. These experiments were conducted on a MacBook Pro with a 2.6 GHz Intel Core i7 6-core processor, 16 GB DDR4 RAM, and Intel UHD Graphics 630, using Python 3.7.9.

4.3 Qualitative Considerations

The application of HYDE-F in the driver scenario, represented by Pandora, resulted in the best performance compared to state-of-the-art approaches. When assessed on other frequently used depth datasets in HPE, such as Biwi and ICT-3DHP, HYDE-F showcased its remarkable ability to generalize. Notably, on the ICT-3DHP database, the proposed framework achieved the best results among state-of-the-art methods. Despite encountering challenges within the Biwi dataset, thereby impacting HYDE-F’s performance, it still managed to produce acceptable results. Further considerations drawing from the obtained results are presented below.

—

Some images of Biwi did not pass the threshold for the control of the non-background landmarks due to the threshold of 1.2 m the authors of the dataset set on the depth channel. This problem, during the test step, affected HYDE-F, which has been built to mainly operate without the preprocessing, which excludes depth information after a certain distance.

—

All the images in ICT-3DHP have a non-background keypoint and can be used to build the template. However, their entropy is always lower than the threshold compared to other datasets. On one side, this negatively affects the results of the single methods, but when fused, the improvement is considerable. This demonstrates that HYDE-F is able to operate on less information-rich datasets.

—

All the images of Pandora have a non-background keypoint and can be used to build the template. In addition, all of them have a sufficient level of entropy and do not need any spatial information to be added. This demonstrates that HYDE-F is particularly suitable for driver scenarios without a preprocess on the data.

In Figure 3 we compare the distribution of the error over the different datasets considered. In Pandora, we find an enhanced difference for the yaw (Figure 3). Here, in particular, the distribution of the errors on pitch and roll is almost the same. In fact, more than 97% of images have an angular error lower than 10 \(^\circ\) and no prediction presents an error higher than 15 \(^\circ\) in pitch and roll. In yaw, the distribution is less encouraging. In fact, some of the predictions have values higher than 15 \(^\circ\) of error. This is, however, a problem with all the methods used on Pandora, since, as can be seen in Table 4, yaw axes are the most difficult to estimate. On Pandora, we can thus conclude that a significant improvement of the method is necessarily associated with an improvement on the yaw axes. Of course, that needs further study to understand why it causes similar problems with such different methods.

Fig. 3.

On Biwi and ICT-3DHP, we have similar behavior, but it is distinct from the above mentioned performances of HYDE-F over those two datasets. For Biwi, more than 90% of the images have an error lower than 15 \(^\circ\) . In particular, more than 70% of images present an error between 0 \(^\circ\) and 5 \(^\circ\) , with the best result in roll and the worst in yaw. For ICT-3DHP, more than 95% of the images have an error lower than 10 \(^\circ\) , followed by the 90% in pitch and the 80% in yaw. HYDE-F has quite the same mean error in Biwi and ICT-3DHP, but, as is also clear from the table, a higher difference between the yaw and roll axes. We can undoubtedly claim that it is the mean error on the yaw axes that affects the final mean error of HYDE-F on ICT-3DHP, as it is clear from the graph. On ICT-3DHP, we can thus conclude that the axes are predicted with different distributions, and we have to take this into account if we want to improve the results, for example on yaw and pitch.

An additional investigation was conducted to assess the impact of parameter settings on the performance achieved on individual datasets. Specifically, the technique of “parameter transfer” was utilized, which involves reusing optimized parameters from one system or model in another. The results are presented in Table 7, which indicates that using optimized parameters for one dataset on a different dataset can lead to increased error rates. This highlights the importance of fine-tuning dataset-specific parameters for more accurate performance. However, it is also evident that the proposed strategy is highly effective for this type of problem, as the recorded errors deviate minimally from the best achieved results, with an average of 0.2 for the Biwi and ICT-3DHP datasets and 0.1 for the Pandora dataset. A supplementary analysis was carried out to evaluate the validity of the proposed weight values that were obtained. In contrast to a conventional averaging approach, where the weights were uniformly set at 0.5, it was noted that the average error performances exhibited an enhancement of roughly 5.56% for Pandora and 6% for ICT-3DHP.

Table 7.

Fusion parameter transfer
Parameters	Dataset	Err_Pitch	Err_Yaw	Err_Roll	Err_Mean
Biwi	ICT-3DHP	5.2	6.6	3.8	5.2
Biwi	Pandora	4.8	7.1	4.7	5.5
ICT-3DHP	Biwi	5.4	5.4	4.7	5.1
ICT-3DHP	Pandora	4.9	6.9	4.5	5.4
Pandora	Biwi	5.1	5.7	4.6	5.1
Pandora	ICT-3DHP	5.2	6.6	3.9	5.2

Table 7. Comparison of Errors between Datasets with Optimized Parameters Swapped between Datasets

This table shows the error rates observed when the optimized parameters for one dataset are applied to the others.

5 Case of Study: Smart Cars

The technology of HPE plays a pivotal role in ADAS by enabling the assessment and understanding of the driver’s head position and orientation. This technical advance provides crucial information about driver status and behavior, with relevant impacts on various aspects of both road safety and vehicle performance. By continuously monitoring the driver’s head position and orientation, this technology ensures that the driver maintains a high level of attentiveness and engagement while operating the vehicle. Moreover, HPE technology aids in the early identification of signs of drowsiness, distraction, or lack of focus, allowing for swift intervention or notifications to prevent accidents.

Smart vehicles, also known as connected automobiles, capitalize on advanced technologies, most notably IoT sensors, to optimize the driving experience. These sensors are strategically positioned throughout the vehicle, gathering data concerning vehicle components, driver conduct, and the encompassing environmental conditions. Among these sensors, cameras serve as particularly valuable tools for the purpose of driver monitoring. However, the challenge arises when confronted with abrupt alterations in lighting conditions, such as transitioning from intense sunlight to shadowed regions or encountering variable meteorological phenomena. These conditions can significantly affect the accuracy and reliability of the captured data.

The HYDE-F system, which we have proposed, thus presents itself as an attractive solution in this context. It takes advantage of depth cameras to accurately estimate the driver’s head pose and effectively addresses the challenges posed by rapidly changing lighting scenarios.

In detecting driver inattention, there are two main categories of systems: those based on driver performance, which use vehicle indicators, and those based on driver behavior, which are based on biometric features [38]. In this section, we will discuss two case studies regarding sensors and how to respond in the event of inattention due to factors such as driver fatigue or distraction.

Upon detecting indications of inattention, the system can activate a range of responses (Figure 4), including:

Fig. 4.

—

Audible and visual warnings: The system can alert the driver with audible and visual warnings, such as a flashing light or an audible alarm.

—

Tactile feedback: The system can activate the driver’s seat or steering wheel to provide a vibration or other tactile feedback to alert the driver.

—

Automatic speed adjustments: The system can automatically adjust the speed of the car to maintain a safe distance from other vehicles on the road.

—

Emergency braking: If the system detects an impending collision, it can initiate an emergency braking system to prevent the accident.

5.1 Road Safety with IoT-Based Fatigue Detection System

Driver fatigue poses a significant threat to road safety, impairing a driver’s focus, awareness, and decision-making abilities, comparable to the effects of alcohol impairment. Fatigued drivers may exhibit signs such as missing road signs, drifting into other lanes, and experiencing delayed reactions or microsleeps, which can lead to loss of control. Driving while fatigued can result in impaired decision-making and risky behaviors. Recognizing the importance of averting these risks, the integration of an IoT system, like HYDE-F, in smart cars offers a robust solution by comprehensively evaluating the driver’s state. By employing a range of advanced sensors, including high-resolution depth cameras, accelerometers, and physiological signal monitors, these sensors can synergistically work together to monitor various aspects of the driver’s behavior and conditions (Figure 4). Depth cameras capture intricate details of the driver’s facial landmarks, enabling precise assessment of head pose, a critical factor in gauging attentiveness. Simultaneously, accelerometers measure subtle movements and shifts in the driver’s posture, providing additional insights into their level of alertness. Complementing this, physiological signal monitors record vital indicators such as heart rate variability, an established metric for assessing cognitive workload and fatigue levels. These data streams could be seamlessly integrated into the IoT network, enabling real-time analysis and response. If the system detects signs of driver fatigue, it could initiate a layered response strategy. Audible and visual alerts can be activated to re-engage the driver’s focus, and the vehicle’s speed can be autonomously adjusted to maintain a safe distance from other vehicles. In critical situations, the system can even engage an emergency braking system, acting as a last line of defense against a potential accident.

—

One of the main sensors in such a system is cameras, which monitor the driver’s head position and gaze direction [33]. This sensor can detect if the driver’s head is drooping, nodding, or looking away from the road, all of which are signs of fatigue. A popular solution is to use a depth camera, such as the Intel RealSense camera, which uses infrared light to capture depth information.

—

The steering wheel sensor plays a crucial role in detecting signs of driver fatigue and loss of focus. By monitoring the driver’s grip on the wheel, it can determine if the grip is weakening or becoming inconsistent, which is indicative of fatigue [31]. Typically, this sensor utilizes advanced sensing technologies like Hall effect sensors or optical sensors to accurately capture the steering wheel’s position and movement.

—

The seat sensors can detect changes in the driver’s posture and position, which can indicate drowsiness. By analyzing this data, an IoT-enabled seat sensor can provide real-time feedback to the driver or trigger an alert to help prevent accidents caused by fatigue. The sensor can then trigger an alert to remind the driver to take a break or to adjust their posture. An IoT-enabled seat sensor can also be used to monitor driver behavior over time. This information can be used to create customized recommendations for individual drivers to prevent fatigue and improve overall safety.

—

Photoplethysmographic (PPG) sensors, as IoT sensors, can be utilized for driver detection by monitoring the driver’s heart rate and blood volume changes. PPG sensors employ light to detect alterations in blood flow and volume in the driver’s skin, which can indicate variations in heart rate. By analyzing this data, the sensor can identify changes in heart rate that may indicate driver fatigue, stress, or other health issues, activating an alert for the driver, a monitoring system, or a central control center to initiate appropriate action. Wearable sensors, such as fitness bracelets or smartwatches, can be considered alternatives to vehicle-integrated sensors. These wearable devices can be equipped with sensors such as accelerometers, gyroscopes, and PPG sensors. Nevertheless, wearable sensors are not specifically designed for driving purposes and may not provide the same level of accuracy and reliability as integrated vehicle sensors. In fact, wearable sensors may be susceptible to errors if not worn correctly or if removed while driving.

Driver fatigue is a serious risk to road safety, necessitating effective measures. An IoT system integrating diverse sensors (camera-based head pose detection, steering wheel, seat, and physiological sensors) serves as an effective solution. Offering real-time feedback, issuing alerts, and persistently monitoring behavior, these sensors accurately identify fatigue patterns, which is crucial for safer driving practices. Finally, we include Table 8 showing current solutions addressing this issue, providing an overview of HPE performance, primarily focused on visual data in driving scenarios. Zhao et al. [43] in particular employ a regression technique for the head posture. Hu et al. [18] offer a deep learning strategy based on point clouds inspired by the PointNet++ framework. Ju et al. [22], which employs Near Infrared (NIR) sensors to produce best performances employing a system of neighboring position cues using 2D Gamma Distribution. Hu et al. [19] suggest another BLSTM network topology to produce temporal-dependent head postures.

Table 8.

Method	Year	Err_pitch	Err_Yaw	Err_Roll	Err_mean	Data
Zhao et al. [43]	2020	6.57	4.16	7.75	6.16	Visible
Hu et al. [18]	2020	6.68	7.32	5.91	6.63	Visible
Ju et al. [22]	2022	2.53	2.87	2.7	2.7	NIR
Hu et al. [19]	2021	5.3	4.69	5.63	5.21	Visible
HYDE-F		4.8	6.8	4.7	5.43	Depth

Table 8. Comparisons of HPE Methods Integrated in Driving Scenario

5.2 An IoT-Powered Distraction Detection System for Safer Driving

Driver distraction is a pervasive issue that significantly contributes to road accidents, encompassing activities that divert a driver’s attention away from the critical task of safe driving. This encompasses engaging in secondary tasks like conversing with passengers, using mobile devices, consuming food or beverages, or interacting with in-vehicle information systems [14]. A pivotal element in detecting driver distraction lies in monitoring the position and orientation of the driver’s head. This aspect bears substantial influence on their ability to maintain situational awareness and promptly respond to potential hazards on the road. Consider a scenario where a driver’s attention is drawn away from the road – perhaps turned to converse with a passenger, affecting the yaw angle, or inclined toward a mobile device, impacting the roll angle. In such instances, there’s an increased risk of missing critical visual cues, like a pedestrian crossing the street or a vehicle merging into their lane. Similarly, when a driver’s head is inclined downwards to look at their phone, potentially altering the pitch angle, they may not perceive a red light or stop sign until it’s too late to react safely. An innovative approach to address driver distraction involves the integration of an IoT system, capitalizing on head pose detection complemented by data gleaned from an array of sophisticated sensors (Figure 4). This system harnesses the interconnected nature of smart cars and leverages depth cameras, offering a more precise estimation of the driver’s head pose compared to conventional algorithms primarily reliant on RGB images. By adapting to dynamic lighting conditions and providing accurate head pose estimations, this technology could play a pivotal role in enhancing road safety.

—

Steering wheel sensors can be used to detect the driver’s grip force and steering angle, providing insights into their level of engagement with the driving task. For instance, if the driver is holding the steering wheel loosely or inconsistently, it could be an indication that they are not fully focused on driving.

—

Accelerometers and gyroscopes can be integrated into the car’s electronic control unit (ECU) to measure the car’s acceleration, speed, and orientation. This information can be combined with GPS data to determine the car’s position and velocity. By analyzing these data streams, the system can identify driving behaviors that suggest distraction, such as sudden changes in speed or veering off the road.

—

LiDAR sensors, as mentioned earlier, can also be used to detect obstacles and other road hazards [13]. These sensors emit laser beams that bounce off objects and return to the sensor, providing a precise 3D map of the surroundings. By analyzing the data from the LiDAR sensors, the system can detect potential hazards, such as pedestrians or other vehicles, and alert the driver accordingly

—

Microphone sensors can also be used to detect changes in the driver’s behavior, such as an increase in conversation volume or a sudden burst of music. The system can analyze the audio data to detect whether the driver is engaged in distracting activities, such as talking on the phone or listening to loud music, and provide appropriate warnings or alerts to refocus their attention on driving.

In conclusion, the integration of IoT technology and depth camera-based head pose estimation within the context of driver distraction detection could represent a transformative approach to enhance road safety.

6 Conclusion

In this article, we present HYDE-F, a method designed to estimate head pose in driver scenarios using a single-depth image. Our main focus is its application in the context of smart cars and IoT. The proposed method integrates seamlessly into larger IoT systems, providing an effective solution for smart car technology. The process starts with a preprocessing step, where we extract facial boundaries using depth landmarks. Following this, the method employs two parallel processes. One utilizes a fractal encoding technique to estimate head pose, while the other relies on keypoint intensity. The predictions of these processes are then combined using an optimization algorithm. It is worth noting that the optimization algorithm only needs to be re-performed when the input source, such as the sensor, undergoes a change. The results of this algorithm produce a set of parameters that enable the fusion of the two predictions through a simple linear combination. After conducting an in-depth analysis of the metrics involved, we have established that HYDE-F stands as a competitive contender in the field, particularly due to its ability to work exclusively with depth images. The proposed framework was successfully evaluated using the Pandora dataset, achieving results that outperformed current SOA methods. Furthermore, it demonstrated impressive generalization ability across various HPE depth datasets, including Biwi and ICT-3DHP. In this case, we achieved an average MAE of 4.9 \(^\circ\) and 5 \(^\circ\) , respectively. A significant result was noticed on Pandora database, where HYDE-F surpassed existing SOA performance in terms of average MAE, showing almost 1 \(^\circ\) lower error. This notable enhancement, particularly in pitch and yaw axes, sets HYDE-F apart from alternative approaches, emphasizing its pivotal role in driver scenarios. Notably, it is crucial to highlight that the proposed method doesn’t rely on deep learning, despite the comparison drawn with deep learning-based approaches in our evaluation. This is a substantial advantage, as deep learning can often be computationally demanding and resource-intensive. In contrast, HYDE-F offers an alternative solution.

Looking ahead, our research will entail a comprehensive analysis of the Pandora dataset, with the primary goal of enhancing head pose estimation results, particularly the yaw angle. By integrating HYDE-F with advanced machine learning algorithms, seamlessly embedded within a smart car framework, our aim is to further refine the accuracy of head pose estimation. Through the fusion of the HYDE-F method with a range of sensors and its integration into a smart car system, we believe it holds significant potential for enhancing driver safety and detecting attention levels.

References

[1]

1991. Fractals for the Classroom: Strategic Activities Volume One. Springer-Verlag New York.

Abstract

1 Introduction

2 Related Works

2.1 IoT Sensors for Smart Car Safety

2.2 Advances and Applications of Head Pose Estimation Technology

3 HYDE-F Framework

3.1 Depth Landmarks Prediction

3.2 Fractal Encoding

3.3 Keypoint Intensity

3.4 Pose Classification

3.5 HYDE-F Fusion

4 Experimental Results

4.1 Driving Environment

4.1.1 Fractal Encoding Results.

4.1.2 Keypoint Intensity Results.

4.1.3 Fusion Results and Comparisons.

4.2 Study of Generalizability the Environments

4.2.1 HPE Depth Datasets.

4.2.2 Generalized Results.

4.3 Qualitative Considerations

5 Case of Study: Smart Cars

5.1 Road Safety with IoT-Based Fatigue Detection System

5.2 An IoT-Powered Distraction Detection System for Safer Driving

6 Conclusion

References

Cited By

Index Terms

Recommendations

IoT-enabled highway safety pre warning system

A 3D Driver Head Pose Estimation Method Based on Depth Image

Head pose estimation and augmented reality tracking: an integrated system and evaluation for monitoring driver awareness

Comments

Information

Published In

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

Get Access

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations