Beyond Basic Tuning: Exploring Discrepancies in User and Setup Calibration for Gaze Estimation

Gonzalo Garde, Electrical and Electronics Eng. Dept., Public University of Navarre, Spain, gonzalo.garde@unavarra.es

José María Armendariz, Electrical and Electronics Eng. Dept., Public University of Navarre, Spain, armendariz.134039@e.unavarra.es

Ruben Beruete Cerezo, Electrical and Electronics Eng. Dept., Public University of Navarre, Spain, rubenberuete.1999@gmail.com

Rafael Cabeza, Electrical and Electronics Eng. Dept., Public University of Navarre, Spain, rcabeza@unavarra.es

Arantxa Villanueva, Electrical and Electronics Eng. Dept., Public University of Navarre, Spain, avilla@unavarra.es

DOI: https://doi.org/10.1145/3649902.3653346
ETRA '24: 2024 Symposium on Eye Tracking Research and Applications, Glasgow, United Kingdom, June 2024

Calibrating gaze estimation models is crucial to maximize the effectiveness of these systems, although its implementation also poses challenges related to usability. Therefore, the simplification of this process is key. In this work, we dissect the impact of calibration due to both the environment and the user in gaze estimation models that employ general-purpose devices. We aim to replicate a workflow close to the final application by starting with pre-trained models and subsequently calibrating them using different strategies, testing under various camera arrangements and user-specific variability. The results indicate differentiation between the impact due to the user and the setup, being the components due to the users a slightly more pronounced impact than those related to the setup, opening the door to understanding calibration as a composite process. In any case, the development of calibration-free remote gaze estimation solutions remains a great challenge, given the crucial role of calibration.

CCS Concepts: • Computing methodologies → Computer vision problems;

Keywords: Gaze estimation, Calibration algorithms, Low-cost and smartphone systems, Computer vision

ACM Reference Format:
Gonzalo Garde, José María Armendariz, Ruben Beruete Cerezo, Rafael Cabeza, and Arantxa Villanueva. 2024. Beyond Basic Tuning: Exploring Discrepancies in User and Setup Calibration for Gaze Estimation. In 2024 Symposium on Eye Tracking Research and Applications (ETRA '24), June 04--07, 2024, Glasgow, United Kingdom. ACM, New York, NY, USA 8 Pages. https://doi.org/10.1145/3649902.3653346

1 INTRODUCTION

Gaze is vital in interpersonal communication, transcending linguistic and cultural barriers. Interpreting gaze direction enriches understanding and facilitates the transmission of emotions, intentions, and cognitive states [Abele 1986; Kleinke 1986]. This holds significance in diverse fields from psychology to artificial intelligence, including human-computer interaction (HCI) [Hessels 2020; Ladouce et al. 2017]. For widespread technology adoption, success hinges not only on technical merits but also on user experience. Comfort and accessibility are crucial, especially beyond specialized environments [Godoi et al. 2020; Li and Liu 2021].

Accurate gaze estimation can be achieved under certain conditions: specialized optical elements; controlled illumination (including infrared sources), and precise image focusing on the ocular region. However, challenges in guaranteeing these conditions in uncontrolled environments and the cost of specialized hardware limit these solutions [Hansen and Ji 2010].

In this context, there is a need to improve existing solutions for what we will refer to as general-camera environments, which make use of off-the-self devices, such as webcams, camcorders or smartphone cameras, for example.

Machine learning has revolutionized computer vision, with neural networks becoming the de facto standard [Bulat et al. 2020; Dosovitskiy et al. 2021; Karras et al. 2020]. Gaze estimation, facilitated by neural networks, democratizes technology use in everyday scenarios [Ansari et al. 2021; Bulat et al. 2020; Dosovitskiy et al. 2021; Guo et al. 2019; He et al. 2019; Karras et al. 2020; Yu and Odobez 2020]. Some of these solutions make use of off-the-self devices such as webcams, camcorders, or smartphone cameras, that seek to bring technology closer to the end user and day-to-day life [Ansari et al. 2021; Guo et al. 2019; He et al. 2019; Yu and Odobez 2020].

Regardless of the nature of the model, calibration enhances its performance. This procedure improves the accuracy addressing two aspects: user calibration and device/environmental calibration (Equation 1). User calibration customizes the system to accommodate individual visual traits like kappa angle, ocular morphology, accessories (e.g., glasses), etc. In contrast, environmental calibration addresses changes in the physical conditions of the testing system, including alterations in lighting, camera position relative to the scene (e.g., screen), typical working distance and more.

\begin{equation} Calibration:= User Calibration + Setup Calibration \end{equation}
(1)

Although the importance of calibrations is not questionable in high-performance (infrared) eye tracking, it is still unexplored in general-camera solutions scenarios, and due to the paradigm shift of orienting solutions towards this reality, we consider that is necessary to conduct specific studies on the calibration process and its impact in this framework. Existing works that made use of calibration on neural network solutions focus on how to perform it, but not on its specific analysis [Chen and Shi 2020; Gudi et al. 2021; Linden et al. 2019; Liu et al. 2019; Park et al. 2019; Yu and Odobez 2020].

In this work we will focus on distinguishing the role of personalisation of the system according to the user and to the system's physical configuration for general-camera solutions. We make the following hypotheses.

H1: Given the dual nature of gaze estimation system calibration, in a general-camera solution isolating one of the calibration components (user or setup calibration) and calibrating solely over it still has a significantly positive impact on the system's performance, albeit not at the same level as a classical calibration.
H2: The impact of each calibration component, i.e. setup calibration and user calibration, is differentiable from one another.
H3: Calibration is not generalizable across users and setups.

2 WORKING FRAMEWORK

In this section we will detail the elements and methodologies adopted during the development of this work.

2.1 Databases for pretraining

2.1.1 U2Eyes synthetic database. We use the U2Eyes synthetic database, specified in [Porta et al. 2019], as the synthetic training dataset for a base premodel. Synthetic databases offer the advantage of providing a high volume of data with a wide range of capture conditions (lighting, poses, capture points...), with the guarantee that the data is properly labeled. The downside is that, being synthetic databases, the results are not completely visually realistic, even with the improvements achieved in recent years in rendering tools.

The U2Eyes database [Porta et al. 2019] is composed of 300 different users. Each user has their own configuration of head shape, textures, and unique visual parameters, such as the horizontal and vertical values of kappa angles for each eye [Garde et al. 2021]. The "capture" distances in this database range from 45 to 65 centimeters.

2.1.2 EVE real database. We use the EVE database [Park et al. 2020] as a pretraining dataset for a realistic base premodel. It is a multicamera and multiuser database, where different visual stimuli, in the form of videos, are presented to each user while gaze tracking is performed with a specialized tracker. In this case, the EVE database consists of 54 users, at a working distance of approximately 60 centimeters. Small deviations may exist, although the distance between the setup and the seat started at those 60 centimeters.

Real-world databases offer advantages due to their alignment with the application environment, even in controlled settings. However, this proximity introduces uncertainties in extracted features, influenced by intricate user responses and marker reliability. While providing an authentic context, real-world databases demand careful consideration of the inherent unpredictability in human responses during data analysis.

2.2 Database for calibration

The models constructed using pretraining databases will be tested using a dataset built within this work. The setup used 6 cameras distributed around a screen, on the same plane, maintaining the same depth with respect to the user, i.e. 60 centimeters approximately. Figure 1 shows a representation of the setup and the distribution of the cameras.

Figure 1: Capture setup. A screen with six different cameras distributed around it. The cameras cover different positions in the (x, y) axes. We consider the X-axis as corresponding to the width of the screen and the Y-axis with the height.

The recording involves displaying two videos of a calibration grid of 4x4 points while the cameras record the scene. Each point appears randomly without replacement. Two corresponding grids and point sequence are assigned to each session so different users undergoing the same session are presented with the same sequences.

To help with the synchronization of the cameras, a LED visible from all the cameras activates when the user clicks the mouse. During video capture, the user is instructed to click the mouse when looking at each point, so that the LED lights up momentarily.

25 users participate in the experiments. Of these users, 14 were men and 11 women, in the age range between 50 and 20 years old. Each user performed 10 sessions and each session consisting of 2 calibration videos, giving a total of 500 videos recorded by each of the 6 cameras. For each calibration point, 3 frames were extracted per video. In Figure 2 we find a representation of some of the captured images and users.

Figure 2: Two of the users present in the captured database. The images a) and b) correspond to two different users looking at the same point with the same camera; In images a) and c), the session and point of gaze is the same, but from different cameras; the image d) uses the same camera as a) and b) but the user is looking at a different point.

2.2.1 Privacy and ethics regarding personal data. The databases used in eye-tracking contain sensitive information, as visual interaction is a personal and unique aspect of each individual. Therefore, special attention must be given to the storage of these databases and the models trained with them. Participants in the study were informed about potential risks and, in accordance with data protection legislation, provided their consent. Sharing the database will be considered upon prior contact with the authors, who are the custodians of the data.

2.3 Model architecture

We propose a workflow aligned to the development of an application: a base model, trained extensively over pre-training databases, as described in 2.1, would then be calibrated by the user, to adjust it to its specific conditions.

Due to neural networks’ opaque behavior, particularly with abstract inputs [Fisher et al. 2019; Simonyan et al. 2014], we opt for features instead of images for better control. Our focus is on measuring the impact of calibration, not finding the perfect model or feature set. Simplicity aids in understanding how calibration tweaks influence performance metrics.

Two feature-based models are used, pre-trained with features from the U2Eyes synthetic database [Porta et al. 2019] and the EVE real database [Park et al. 2020]. This accounts for potential biases between synthetic and realistic models. Figure 3 illustrates the features, automatically extracted using a purpose-trained model [Larumbe-Bergera et al. 2021].

Figure 3: Features extracted from the images to use as input to the model. For each of the frames, various features related to the eye system (corners of the eyes and pupils) and the head position (chin, corner of the lips, tip of the nose and base of the ears) are extracted, so the model can have useful information without using the image directly.

The employed model is a fully connected layers-based model with the following architecture:

An input of 24 features. It consists of a vector of the pixel coordinates (x, y) of the 12 features extracted from the images (Figure 3). The values are normalized relative to the size of the images in each dimension (1920x1080).
Intermediate stages following the pattern: N-neuron layer, N-neuron layer, dropout (0.2), with the values of N being [512, 128, 32, 16].
An output of 2 values corresponding to the (x, y) components of the Look-At-Point (LAP). The values are normalized between -1 and 1 based on the LAPs present in the training dataset (within a maximum range of 0-1920 horizontally and 0-1080 vertically).

3 EXPERIMENTS

The design of the experiments is inspired by usage scenarios that we believe may occur during the real-world application of general-camera solutions. We outline the model's pre-training, considering two scenarios: synthetic database pre-training (Section 2.1.1) and real-world database pre-training (Section 2.1.2). This results in the two different baseline models upon which various calibrations will later be performed. Three different experiments are devised: classic calibration, user calibration and setup calibration. Each of the experiments is repeated for each of the two baseline models and, within these, for each user and each camera.

The same base procedure is followed in the case of the three proposed experiments: we start from a pre-trained base model; then the pre-trained base model is calibrated (fine-tuned with calibration data) and, finally, that calibrated model is tested. The calibration and testing sets are defined by the conditions of each experiment, which will determine which combination of users and setups are used in each case. The calibration and test stages are carried out using users belonging to our dataset as explained in section 2.2.

The same fitting conditions are maintained for the three experiments: we use a batch size of 32 inputs, Adam optimiser, MSE as training and validation loss, an initial learning rate of 0.01 and we fit the model during two phases of 100 epochs, applying the early stopping technique in case there is no improvement in the validation loss during 15 epochs. In the second phase, we reduce the learning rate to 0.001. To prevent overfitting, we employ ridge regression or L2 regularisation, with a value of 0.0005.

In addition to the verbal descriptions of the experiments, visual representations of all three can be found in the Appendix A, offering a supplementary graphical insight into the procedures and outcomes discussed.

3.1 Classic calibration

We perform a classic calibration in which a pre-trained model is adjusted to the conditions of the user and target setup. This base model is calibrated using features of a single user captured on a single setup configuration. After calibration, the calibrated model is tested over images of the same user captured from the same setup configuration but not used in the calibration phase.

3.2 User calibration

Calibration is performed on features of a unique user captured in different setup configurations, i.e. by using different cameras. The test then uses features of the same user captured in a setup configuration not present among the calibration ones. The aim is to limit the calibration to the adaptation to the individual, studying the impact of the change in the setup.

3.3 Setup calibration

Calibration is performed on features from multiple users captured on the same physical layout, i.e. a single camera. System testing is performed on features captured by the same camera from a user outside the calibration group. The aim is to isolate the degree to which the calibration is able to adapt to the system conditions, independently of the user.

4 RESULTS

Table 1: Median, mean and standard deviation of the angle results before calibration (in degrees) for each of the two models use as base. These results constitute the starting point for measuring the performance of various calibration proposals.

	Median	Mean	SD
Baseline model
EVE	18.38	18.88	8.72
U2EYES	33.08	30.53	12.09

Table 2: Median, mean and standard deviation of the results per experiment (in degrees). As we use two different databases to pretrain the models, we distinguish between the two cases, EVE and U2EYES model.

		Median	Mean	SD
Experiment	Baseline model
Classic calibration	EVE	4.86	6.62	5.99
	U2EYES	6.30	8.00	6.17
Setup calibration	EVE	8.71	10.08	6.75
	U2EYES	8.48	9.92	6.79
User calibration	EVE	11.12	12.39	7.43
	U2EYES	10.79	12.12	7.79

Table 3: Influence of test user change for each of the experiments. The values are obtained after grouping, in each experiment, the results based on the user. Then, for each of the 25 groups, the median is saved. From the list of 25 median values, the median, mean, and standard deviation are calculated.

		median	mean	std
Experiment	Baseline model
Classic calibration	EVE	4.78	5.12	1.59
	U2EYES	5.95	6.71	2.37
Setup calibration	EVE	8.27	8.86	1.67
	U2EYES	8.38	8.71	2.10
User calibration	EVE	10.42	11.34	2.30
	U2EYES	10.66	11.01	2.93

Table 4: Influence of testing camera change for each of the experiments. The values are obtained after grouping, in each experiment, the results based on the test camera. Then, for each of the 6 groups, the median is saved. From the list of 6 median values, the median, mean, and standard deviation are calculated.

		median	mean	std
Experiment	Baseline model
Classic calibration	EVE	4.81	4.84	0.25
	U2EYES	6.29	6.31	0.56
Setup calibration	EVE	8.67	8.72	0.91
	U2EYES	8.45	8.52	0.75
User calibration	EVE	11.13	11.28	2.16
	U2EYES	11.13	10.92	1.69

Figure 4: Composite analysis depicted in subfigures (a), (b), and (c). For each violin plot, the blue side corresponds to the results over EVE pre-trained model and the orange, over U2EYES. Subfigure (a) portrays a holistic examination encompassing all conducted tests. Subfigure (b) conducts a similar analysis, this time categorizing results based on the test userSubfigure (c) showcases categorization based on the testing camera. They complement the information from Tables 2, 3 and 4, respectively.

The difference in degrees between the estimated and the actual LAP will be used as figure of merit. The way to calculate the corresponding angular error follows the formula below:

\begin{equation} Angular Error:= \arctan (\frac{\Vert p-\hat{p}\Vert _2}{d}) \end{equation}
(2)

where p is the real LAP, (x, y), $\hat{p}$ is the estimated LAP ($\hat{x}$, $\hat{y}$) and d is the distance from the grid, that is estimated as 60 cm.

Table 1 presents the angle results before any calibration, to serve as reference for the results obtained from the experiments. In Tables 2, 3 and 4 we find a summary of the results obtained for the three experiments, but addressing different analysis. Table 2 aggregates all test cases, without distinguishing between user and/or camera. Table 3 groups the results of each experiment according to test user, while Table 4 groups by the test camera. Similarly, Figure 4 is presented encompassing the different analyses. Subfigure (a) represents a comprehensive analysis considering all tests conducted. Subfigure (b) performs a similar analysis, but this time grouping results based on the test user, utilizing information from Table 3. Finally, subfigure (c) displays grouping based on the testing camera, corresponding to Table 4.

5 DISCUSSION

5.1 Classic calibration vs user and setup calibration

Reviewing Table 1, we find that the median values of both models are not directly usables. The notable distinction between the model pre-trained in a real environment (EVE) and the one pre-trained on U2EYES implies that, in an uncalibrated environment, the EVE-pre-trained model aligns more closely with the testing domain, although calibration remains necessary.

If we compare the values shown in Table 1 with respect to Table 2, it can be observed that all three methods show improvement over the baseline results. Classic calibration seems to provide the most subtantial improvement, as expected, but setup and user calibration also exhibit positive effects. This reaffirms hypothesis H1 that, although classical calibration yields the best results, presenting some form of calibration has a positive impact on the system's performance. However, while these approaches yield some improvement over uncalibrated systems, their ultimate applicability is likely to be limited compared to solutions using classic calibration. These results suggest that the generalization ability of these models is not as effective to adapt to new setups or users, as posited in hypothesis H3.

5.2 The impact of user and setup configuration in calibration

Analyzing Tables 3 and 4 provides insights into the influence of the user and camera. While the mean and median values resemble those in Table 2, examining the standard deviation reveals that, in the case of camera variation, the deviation is very low for both Experiment 3.1 and Experiment 3.3. These low standard deviation values suggest that, even with a change in the physical arrangement of the setup, these calibration topologies exhibit robustness to such changes, resulting in consistent performances across runs. In other words, classic recalibration and setup recalibration enable a system to readapt to a new configuration.

As for the values in Table 3, they suggest that the user has a more crucial role in gaze estimation than the physical distribution of components (camera-grid relationship). This inference is supported by the significant impact on model performance when there is a change in the test user.

From these results, we could say that:

Both user and setup calibration have a different impact on calibration.
For essential usage variations (change of calibration user or change of calibration camera), setup and user calibration exhibit different adaptability behaviors.

This aligns with hypothesis H2. It suggests reconsidering the calibration process as a minimization problem of the two conditions. Classic calibration, as performed in general-camera solutions based on neural networks, addresses both aspects simultaneously and indivisibly. While good results are achieved, perhaps performance could be improved by differentiating the minimization function into the two components, similar to what is done in other neural network problems.

6 CONCLUSIONS

The experiments for this paper were guided by two goals:

To determine the continued relevance of calibration in general-camera gaze estimation systems.
Study the influence of capture hardware and user on calibration.

A custom multi-camera setup was developed to capture diverse views of the same user during the calibration process. Subsequently, data from multiple users were utilized to train a feature-based model, simulating both classical calibration and variations isolating user and setup conditions.

The first conclusion is that calibration remains a crucial process for achieving accurate gaze estimation results, surpassing the capabilities of calibration-free systems. Developing a model capable of generalizing across all setups and individual characteristics, a prerequisite for calibration-free remote gaze estimation, proves highly challenging

Among the studied setup and individual conditions, components related to the individual have a slightly more pronounced impact on calibration compared to those related to the setup. Challenges arising from individual characteristics are more difficult to address during general model training. In a multi-user work environment, the emphasis on user adaptation can potentially limit the effectiveness of these solutions, particularly when compared to specialized device-based alternatives.

Regarding physical components, increasing the variety of capture setups, incorporating multiple views, or transitioning to 3D gaze vector estimation can potentially mitigate limitations. However, for user-related components, leveraging their personal characteristics appears necessary, with uncertainties about the effectiveness of multiple views or 3D training in this context.

REFERENCES

Andrea Abele. 1986. Functions of gaze in social interaction: Communication and monitoring. Journal of Nonverbal Behavior 10 (1986), 83–101. https://link.springer.com/article/10.1007/BF01000006
Mohd Faizan Ansari, Pawel Kasprowski, and Marcin Obetkal. 2021. Gaze Tracking Using an Unmodified Web Camera and Convolutional Neural Network. Applied Sciences 11, 19 (2021). https://doi.org/10.3390/app11199068
Adrian Bulat, Jean Kossaifi, Georgios Tzimiropoulos, and Maja Pantic. 2020. Toward fast and accurate human pose estimation via soft-gated skip connections. In 2020 15th IEEE International Conference on Automatic Face & Gesture Recognition.
Zhaokang Chen and Bertram Shi. 2020. Offset Calibration for Appearance-Based Gaze Estimation via Gaze Decomposition. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV).
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2021. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In International Conference on Learning Representations. https://openreview.net/forum?id=YicbFdNTTy
Aaron Fisher, Cynthia Rudin, and Francesca Dominici. 2019. All Models are Wrong, but Many are Useful: Learning a Variable's Importance by Studying an Entire Class of Prediction Models Simultaneously. Journal of Machine Learning Research 20, 177 (2019), 1–81. http://jmlr.org/papers/v20/18-760.html
Gonzalo Garde, Andoni Larumbe-Bergera, Benoît Bossavit, Sonia Porta, Rafael Cabeza, and Arantxa Villanueva. 2021. Low-Cost Eye Tracking Calibration: A Knowledge-Based Study. Sensors 21, 15 (2021). https://doi.org/10.3390/s21155109
Tatiany X. Godoi, Deógenes P. da Silva Junior, and Natasha M. Costa Valentim. 2020. A Case Study About Usability, User Experience and Accessibility Problems of Deaf Users with Assistive Technologies. In Universal Access in Human-Computer Interaction. Applications and Practice(Lecture Notes in Computer Science, Vol. 12189). 73–91. https://doi.org/10.1007/978-3-030-49108-6_6
AmoghLi Gudi, Xin Li, and Jan van Gemert. 2021. Efficiency in Real-Time Webcam Gaze Tracking. In Computer Vision – ECCV 2020 Workshops.
Tianchu Guo, Yongchao Liu, Hui Zhang, Xiabing Liu, Youngjun Kwak, Byung In Yoo, Jae-Joon Han, and Changkyu Choi. 2019. A Generalized and Robust Method Towards Practical Gaze Estimation on Smart Phone. In 2019 International Conference on Computer Vision (ICCV) Workshops(ICCV’19).
Dan Witzner Hansen and Qiang Ji. 2010. In the Eye of the Beholder: A Survey of Models for Eyes and Gaze. IEEE Transactions on Pattern Analysis and Machine Intelligence 32, 3 (2010), 478–500. https://doi.org/10.1109/TPAMI.2009.30
Junfeng He, Khoi Pham, Nachiappan Valliappan, Pingmei Xu, Chase Roberts, Dmitry Lagun, and Vidhya Navalpakkam. 2019. On-Device Few-Shot Personalization for Real-Time Gaze Estimation. In 2019 IEEE International Conference on Computer Vision (ICCV) Workshops(ICCV’19).
R. S. Hessels. 2020. How does gaze to faces support face-to-face interaction? A review and perspective. Psychonomic Bulletin & Review 27 (2020), 856–881. https://doi.org/10.3758/s13423-020-01715-w
Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. 2020. Analyzing and Improving the Image Quality of StyleGAN. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
C. L. Kleinke. 1986. Gaze and eye contact: A research review. Psychological Bulletin 100, 1 (1986), 78–100. https://doi.org/10.1037/0033-2909.100.1.78
Simon Ladouce, David I. Donaldson, Paul A. Dudchenko, and Magdalena Ietswaart. 2017. Understanding Minds in Real-World Environments: Toward a Mobile Cognition Approach. Frontiers in Human Neuroscience 10 (2017). https://doi.org/10.3389/fnhum.2016.00694
Andoni Larumbe-Bergera, Gonzalo Garde, Sonia Porta, Rafael Cabeza, and Arantxa Villanueva. 2021. Accurate Pupil Center Detection in Off-the-Shelf Eye Tracking Systems Using Convolutional Neural Networks. Sensors 21 (2021), 6847. Issue 20. https://doi.org/10.3390/s21206847
Yaqi Li and Caifeng Liu. 2021. User Experience Research and Analysis Based on Usability Testing Methods. In Advances in Graphic Communication, Printing and Packaging Technology and Materials(Lecture Notes in Electrical Engineering, Vol. 754). 263–267. https://doi.org/10.1007/978-981-16-0503-1_39
Erik Linden, Jonas Sjostrand, and Alexandre Proutiere. 2019. Learning to Personalize in Appearance-Based Gaze Tracking. In The IEEE International Conference on Computer Vision (ICCV) Workshops.
G. Liu, Y. Yu, K. A. Funes Mora, and J. M. Odobez. 2019. A Differential Approach for Gaze Estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2019), 1–1. https://doi.org/10.1109/TPAMI.2019.2957373
Seonwook Park, Emre Aksan, Xucong Zhang, and Otmar Hilliges. 2020. Towards End-to-end Video-based Eye-Tracking. In European Conference on Computer Vision (ECCV).
Seonwook Park, Shalini De Mello, Pavlo Molchanov, Umar Iqbal, Otmar Hilliges, and Jan Kautz. 2019. Few-Shot Adaptive Gaze Estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV).
Sonia Porta, Benoit Bossavit, Rafael Cabeza, Andoni Larumbe-Bergera, Gonzalo Garde, and Arantxa Villanueva. 2019. U2Eyes: A Binocular Dataset for Eye Tracking and Gaze Estimation. In 2019 IEEE International Conference on Computer Vision (ICCV)(ICCV’19).
K. Simonyan, A. Vedaldi, and Andrew Zisserman. 2014. Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps. CoRR abs/1312.6034 (2014).
Yu Yu and Jean-Marc Odobez. 2020. Unsupervised Representation Learning for Gaze Estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

A EXPERIMENTS ILLUSTRATIONS

This appendix aims to provide a visual representation of the experiments discussed in the Section 3. Given the complexity of the experiments, illustrations are included to complement the previously provided verbal description.

A.1 Classic calibration

. Oriented towards a single user on a single, fixed device. These conditions allow for the development of classic calibration, where the baseline models can be calibrated with images from the user itself and under similar conditions. For example, an application designed for a personal desktop computer. A graphical representation of the calibration and test conditions is shown in Figure 5.

Figure 5: In this figure, we can observe the nature of the data used during the "classic calibration" experiment 3.1. In this experiment, for the same user and capture setup, the model calibration and testing are performed. The inputs used during testing are not used during the model calibration.

A.2 User calibration

In this case, we draw inspiration from applications where, although the user remains unique, the setup may undergo changes in its physical arrangement. This would be the case for applications developed for smartphones or laptops, where the camera position, lighting conditions, etc., are more subject to changes. Therefore, the calibration will use inputs from the same user, but during its actual application, the physical arrangement may not have been captured during the calibration process. A graphical representation of the calibration and test conditions is shown in Figure 6.

Figure 6: In this figure, we can observe the nature of the data used during the "user calibration" experiment 3.2. The user remains the same both for the calibration and the testing, but the calibration and test camera configurations are different. By doing this, we are trying to isolate the impact of the setup in the calibration, as the only difference between calibration and test is due to the change in the setup

A.3 Setup calibration

We consider the case of a fixed setup that needs to estimate multiple individuals. This could correspond to an estimation system in a commercial booth, where customers passing through the booth have not been part of the calibration. A graphical representation of the calibration and test conditions is shown in Figure 7.

Figure 7: In this figure, we can observe the nature of the data used during the "setup calibration" experiment 3.3. In this case, the setup remains the same both for the calibration and the testing, but the user whose features are used in the test was not part of the calibration users dataset. By doing this, we are trying to isolate the impact of the user in the calibration, as the only difference between calibration and test is due to the presence of an unknown user.

CC-BY license image
This work is licensed under a Creative Commons Attribution International 4.0 License.

ETRA '24, June 04–07, 2024, Glasgow, United Kingdom