In this section, we provide details on the experimental design and procedures used in this study. First, we introduce the outlier search game implemented in traditional (2D) and virtual reality environments. Next, we describe the study participants, apparatus, game structure, and trials. Finally, we outline the data collection process, pre-processing methods, feature extraction techniques, and classification models.
3.1. Outlier Search Game
The game was implemented in two environments: a traditional monitor, mouse, and keyboard setting versus an immersive VR system with a head-mounted display and hand controllers. The outlier search game requires players to identify anomalous images embedded within normal background images in 3D space within a time limit. Each trial had a time limit of 10 or 15 min. Each game shows a total of 260 images, of which 10 are outliers. In terms of their distribution, we used 5 different categories and showed players 50 normal and 2 outlier images per category.
One of our motivations for creating the outlier search game, or as the odd-one-out game is also called, was the popularity and simplicity of the game and the fact that it is often used in research to test various human cognitive abilities [
66,
67]. On the other hand, it has been shown that at a simple level, the recognition of outliers requires little to no knowledge from the participant, and an earlier implementation of the concept was tested in a domain expert collaboration task, but without the gamified context [
68].
Before the actual data collection, we performed two pilot experiments to test the experimental setup and the game settings. The data recorded during this time was excluded from the evaluation. Based on the useful feedback, we clarified the instructions, refined the data recording protocol and limited the time and the number of outliers so that participants did not get too tired throughout the entire experiment (6 play sessions). Therefore, we set the maximum execution time to 15 min, the other version at 10 min and the number of outliers for each session at 10. Based on user experience feedback the time allowed and the number of tasks seemed appropriate for participants who were not familiar with VR games (they were all experienced with a traditional PC setup).
In the traditional setup, we used a custom plugin developed in Typescript and Python, incorporated into our NipgBoard tool—an interactive online system based on Google’s TensorBoard [
69]. We used its embedding projector functionality to display the game in the Google Chrome web browser. A visualization of the NipgBoard’s interface can be seen in
Figure 2. Participants were instructed not to alter interface settings or other functions.
The VR version was developed using Unity3D and C#, with the SteamVR plugin integrating the VR headset with the game. We utilized Unity Edition Professional, version 2020.3.1f1, as the game engine. The Unity MainCamera was replaced by the Player prefab camera. C# scripts enabled VR capabilities. An example view of the virtual reality environment is shown in
Figure 3.
Graphical layouts utilized the MVTec Anomaly Detection image dataset [
70]. This database contains 5000 high-resolution images in 15 classes, each with defective and defect-free examples. Images displaying scratches, cracks, contamination, or structural damage are labeled defective. Normal images exhibit no visible flaws. We selected the following 5 classes: bottle, hazelnut, leather, tile, and transistor with 50 normal and 2 anomalous samples each.
To enable 3D visualization of the image data, we first performed feature extraction using pre-trained deep neural networks (DNNs). For the traditional setting (2D), we used the VGG16 ImageNet model [
71], while the VR version utilized ResNet50 [
72]. We then applied dimensional reduction techniques on the extracted feature vectors to project the images into an embedded space. Principal component analysis (PCA) was used in both environments to reduce the features into three dimensions. For the VR game, we also applied t-distributed stochastic neighbor embedding (t-SNE) [
73] for additional dimensionality reduction.
These methods produced well-separated, distinguishable image clusters based on the learned feature representations. Simple observation of the 3D visualizations showed a clear separation between the different image categories. The 2D and 3D environments were similar in terms of the display of the full image set: all 5 image clouds were in the field of view at launch and the participant had the freedom to choose which image set to observe first. The NipgBoard projection space allowed the participant to operate the zoom and rotation from an external view. In VR, the participant started from a central point and all the images surrounded his position, with navigation allowing him to bypass or even pass through the images.
To illustrate the visual possibilities offered by the virtual environment,
Figure 4 shows a series of images captured from the VR view. To facilitate participants’ attention to the task, no extra visual background elements were used.
Differing user input modalities naturally characterize the traditional and VR scenarios. In the traditional setting, users navigate among the displayed images using a mouse or keyboard, with actions performed via mouse clicks. Conversely, in the VR environment, navigation is achieved through free movement in the designated play area, utilizing one or both controllers to execute interactions by pressing the controller buttons.
In general, we aimed to make the game environments as similar as possible in terms of task-solving, game mechanics and feedback methods. The biggest challenge was to create similar navigation in 2D and 3D spaces. The traditional input method in 2D is primarily the mouse, so since we wanted to simulate smooth and free mouse movement in the virtual environment, flying techniques seemed to be the appropriate method instead of jumping or teleporting in space.
It is a natural need that participants wish to receive real-time feedback on their performance. We wanted to avoid participants selecting too many images without actual observation or considered choice so that incorrect selections would be penalized. Their actual performance was reported back to them by the F1 value, the calculation of which was explained to them beforehand. To inform users about their real-time progress and the success or failure of their actions, various visual feedback mechanisms were implemented. Both game environments provided the following system responses:
Green overlays on images following a correct selection;
Red overlays on images after an incorrect selection;
Blue overlays on images under active observation with the mouse cursor or gaze;
After the successful selection of both outliers within an image group, all remaining images received green overlays;
The remaining time, outlier counter, and F1 score were dynamically updated and displayed in the NipgBoard interface or in the VR field of view.
The NipgBoard interface further displayed feedback with text messages corresponding to correct, incorrect, and repeated selections. Moreover, to monitor user performance in real-time, we utilized commonly used evaluation metrics. The F1 score, an effective measure for classifier model efficiency, is a weighted average of precision and recall values, yielding a score between 0 (worst) and 1 (best). Precision measures the ratio of correct predictions to total predictions made by the model, while recall (true positive rate) represents the ratio of correct predictions to all possible correct predictions. In the context of the outlier search game, a correct prediction corresponds to a participant selecting a true outlier image, whereas an incorrect or false prediction arises when a normal, non-outlier image is selected.
To enable accurate calculations, deselecting images once they were selected was prohibited. The game concludes under two scenarios: either the time limit is reached, or all 10 outliers are found within the allocated time. Participant performance metrics were designated for incorrect image selection (misclicks), completion time, and F1 score for each game trial.
3.2. Experimental Design
For the sake of completeness, we describe the whole process of the experiments and data collection, but due to the focus of the current study on gaze-related analysis, the wider evaluation of the data collected from the questionnaires and other logs will not be detailed. The experimental design consisted of three main phases: introduction, training, and data collection. The flowchart with details of each stage is represented in
Figure 5.
We started with an instruction phase, where the general concept of the experiment, outlier search game design, and different control methods of the game environments were explained to the participants. As part of this stage, the general data protection regulation (GDPR) form, the Big Five Inventory-2 (shortly BFI-2) personality traits test [
29] and general information forms about prior gaming experience were filled out. Participants were also asked to complete the pen-and-paper Group Bourdon attention test. The game experience form was a short Likert-scale questionnaire—our original work.
To avoid any doubts about whether the image is an outlier or not, a short training session was also carried out in the introduction stage. During the training, two pairs per image class were presented, for a total of 10 comparison exercises. Participants had to compare the two given images and decide which one was the outlier. Questions and free discussion were allowed on the solutions.
The final step of this phase was to assign a random order to the six pre-defined game modes and to inform the participant accordingly. The possible variations of game settings in terms of environment, navigation technique, and time frame can be seen in
Table 1.
In the training phase, participants were allowed to try out the following two navigation modes in virtual reality: one-handed flying (OHF) and two-handed flying (THF). Drawing on relevant literature, we chose navigation methods that are generally considered to be easy or enjoyable for users to learn but also sufficiently different to be comparable. According to the study of Drogemuller et al. [
74] where they compared different, commonly used VR navigation techniques, they found that two-handed flying is the fastest and most preferred among the 25 involved participants. Based on their findings, one-handed flying was also reported as one of the methods that was easiest to understand and perform. On this basis, we decided to choose a rather simple method (OHF) and a more enjoyable but more complex one (THF). As they require the use of hands in different ways, they are sufficiently distinct to be further evaluated.
OHF is an unimanual navigation technique that indicates that the user is pointing in the desired direction, and the further the arm is stretched, the faster the movement. THF is a bimanual method for navigation in virtual environments, in which the user uses both hands to describe a vector that determines the direction and speed.
After the VR training, the traditional experiment setup was also introduced to the participants. The navigation can be performed with mouse movements and clicks or with the keyboard’s arrow keys. In the projection space of NipgBoard, one can zoom in/out using the mouse scroll wheel, shift position by holding down the right mouse button and moving the cursor, and rotate the camera view by holding down the left mouse button and moving the cursor. To practice the navigation in the traditional setting, we used a similar data set as in the experiment, and in the VR setting, we used the normal and defective elements of the following categories from the MVTec Anomaly database: carpet, grid, wood, and pill. The distribution of outliers and non-outliers was the same as in the actual experiment. In this phase, we did not record gameplay logs or personal data.
The number of outliers per game trial was the same for all participants, but to exclude the possibility that users remembered the location of the outliers, the defective samples were different in each attempt. We prepared six sets of data for the experiment and four backup sets.
The data collection phase was the final stage of the experimental work. In this session, participants were asked to solve the pre-selected, randomly sequenced outlier search tasks. The participants were tested individually. Each experiment took approximately three hours. Each participant had one attempt to complete the experimental stage. However, if dizziness or fatigue is reported while navigating in virtual reality, the experiment is stopped and the user can repeat it later with a different set of data. If any technical problems occurred, the game trial was also repeated. Participants were allowed extra rest time upon request. The outlier images were shown to the participants after each experiment as solutions to the gameplay.
Depending on the game parameters, the participants filled out questionnaires after game trials. After each game, participants responded to a Likert-scale questionnaire we created about their experiences in each game session. We aimed to collect the usability characteristics in terms of navigation, game environment, and time limit parameters.
Furthermore, if the environment was virtual reality, participants were asked to fill in a simulator sickness questionnaire (SSQ) [
75] to indicate the subjective occurrence and severity of any symptoms they might have on a detailed symptom list. As the work of Bimberg et al. [
76] shows, the SSQ can be applied to novel virtual reality research, despite its limitations.
After the six game trials had been completed, participants filled in a questionnaire on the overall user experience and a form measuring subjective sense of presence. The summary questionnaire was compiled by us and allowed the participants to compare and rate all game settings. The sense of presence form used was based on the work of Witmer et al. [
77], where the relationship between task performance and the phenomenon of presence in a virtual environment was investigated. Lastly, in the form of a short, informal interview, participants were allowed to share their comments on the experiment.
In terms of data collection, in the traditional setting, we captured facial videos of the participants using a simple off-the-shelf HD webcam, gaze data with the Tobii Nano device, and mouse coordinates, screen capturing, keyboard, and game events using our custom python scripts. In the virtual reality environment, we collected the following data: user actions and game events, gameplay frames, navigation coordinates, and gaze data. For gaze data, we collected logs of gaze origin, gaze direction, and blinks, which were built-in variables in the HTC Vive Pro Eye.
Apparatus and Participants
Various virtual reality headsets are available in the market, and according to Angelov et al. [
78], HTC Vive Pro is the best performer in terms of technical parameters. The final chosen apparatus for the experiment was the HTC Vive Pro Eye, which includes eye tracking. This device has a resolution of 1440 × 1600 per eye, a field-of-view of 110 nominal, and an optical frame rate of 90 Hz. To ensure eye tracking in the traditional game environment, we used the Tobii Pro Nano eye tracker attached below the monitor. This device has a sampling rate of 60 Hz, and it has a video-based eye tracking technique that relies on pupil and corneal reflection with dark and bright pupil illumination modes.
For running both game environments we used an AMD Ryzen 7 2700 eight-core computer, a 3.2 GHz processor with 32 GB RAM, an NVIDIA GeForce GTX-1080 video card, and a Windows 10 Home Operating System.
The actual experiments were carried out with the involvement of 30 participants, with 11 females and 19 males, aged from 20 to 41 (M = 26.26, SD = 4.37).
Most of the participants (29) were right-handed, and one person was ambidextrous. In terms of eyesight, 15 participants did not need vision correction, 2 wore contact lenses and 13 wore glasses. Regarding the experience with virtual reality, participants reported that 18 had no previous experience with VR, 5 had tried it occasionally but not in games, and 7 were familiar with VR games. Concerning experience in using the mouse, 18 participants declared themselves to be very skilled and used a gamer mouse, while 12 participants reported being sufficiently skilled in using a mouse.
The volunteers were instructed that data about their gameplays will be logged for further analysis and the answers from the additional tests and questionnaires will be used as well. The participants were asked to sign a GDPR consent form before the experiments, and the Ethics Committee of the Faculty of Informatics, Eötvös Loránd University approved the study. We anonymized personal data before further evaluation.
3.4. Data Pre-Processing
To enable categorization for the classification task, the continuous personality trait scores (ranging from 0 to 100) were divided into low, medium, and high categories using a data-driven binning approach. Specifically, boundaries for the three bins were determined by first inspecting the distribution of scores for each trait independently (see
Figure 6). Thresholds were then defined at the 33% and 66% percentile scores, resulting in evenly spaced bins each containing approximately one-third of the participants. This binning strategy ensured a balanced classification while maximizing separation between categories based on the empirical score distributions.
Unlike personality traits, attention-related scores were discretized into two classes of high and low based on a similar data-driven binning approach. This binary representation was chosen for simplicity since no precedents existed for establishing standardized category boundaries. To determine the binning threshold, the distribution of processing speed values and processing accuracy ratios were analyzed (see
Figure 7). The histograms revealed natural cut points that split participants into two evenly-populated groups—one with relatively high scores and one with lower scores. For processing speed, the threshold was identified at 170 groups/min. Values above this limit were labeled as high processing speed, while scores below it were designated as low processing speed. Likewise, for processing accuracy, the distribution suggested 0.975 as the cut point, with scores
categorized as high processing accuracy and scores
categorized as low processing accuracy.
While our initial discretization of personality traits into low, medium, and high categories was designed for the classification task, we adjusted our strategy for the correlation analysis. Specifically, to facilitate consistent statistical testing alongside the binary attention-related groups, we adopted an additional median split, consolidating the three personality trait categories into two classes (low and high).
This arbitrary split was solely employed to allow for an easier correlation analysis as well as to enable the use of uniform statistical methods. It is important to note that the correlation analysis is exploratory, and the presence of false positives (type I error) cannot be completely excluded, but the details of the statistical results and the effect size give a good indication of the results.
Finally, quality control checks revealed corrupted gaze log files for two participants. To enable analysis accuracy and data integrity, these individuals were excluded from all subsequent analyses.
3.5. Feature Extraction from Gaze Data
We pre-processed the raw gaze data via feature engineering. We first performed linear interpolation on the missing gaze vectors and then proceeded to extract eye movement events.
In the VR environment, we detected fixations and saccades using a modified Velocity-Threshold Identification (I-VT) suitable for the VR setting [
84]. Since there is no prior knowledge on gaze velocity and duration thresholds for fixation and saccades detection in the VR setups. We first experimented with threshold values used in [
62] but these yielded negative performance in the classification tasks. Therefore, we utilized I-VT in conjunction with the Median Absolute Deviation (MAD), which is a robust estimator of dispersion that is resilient to the influence of outliers and can automatically find a coherent separation threshold [
85]. In the 2D environment, we utilized the Robust Eye Movement Detection for Natural Viewing (REMoDNaV) algorithm [
86]. REMoDNaV is a robust eye movement detection method that accounts for variations in the distance between participants’ eyes and the eye tracker over time, making it suitable for scenarios with varying distances and ensuring its robust performance. Furthermore, it performs robustly on data with temporally varying noise levels.
During fixation, the visual gaze is sustained on a single location for a specific period. These fixations are valuable indicators of attention and cognitive processing activity [
87,
88]. As part of our features, we used fixation rate, which is the number of fixations per second, fixation duration, which is the total duration of fixations, and extracted three statistical descriptors, namely, mean, standard deviation, and max from the duration of the fixations.
Saccades provide valuable information and have been found to have a strong correlation with visual search behavior [
89]. In the same manner as fixation, we used saccade rate, saccade duration, and the corresponding statistical descriptors from the duration.
To extract gaze features from the raw logs, we used CatEyes [
90] Python Toolbox, which includes REMoDNaV and the modified I-VT. Initially, for each individual setting in 2D (2D-10, 2D-15) and VR (OHF-10, OHF-15, THF-10, THF-15), we extracted our set of fixation and saccades features. This resulted in a collection of 10 gaze features per setting. Then, we concatenated the features from all settings for each environment to form the final set of features, referred to as “2D-All” and “VR-All”.
Finally, to ensure the most effective predictive performance for both the Big Five traits and attention-related groups, we employed a feature selection approach to retain only the most informative gaze features. Firstly, we calculated the importance of each gaze feature using three different metrics, namely chi-squared, mutual information, and ANOVA F-value. Among these metrics, mutual information yielded the best results overall and was selected as our feature selection metric.
Then, we iteratively trained our models by increasing the number of features from to T, where T is the total number of features for each 2D/VR setting. Specifically, for individual settings in 2D/VR, for 2D-All, and for VR-All. We repeated this process until we obtained the best F1 average performance. This allowed us to identify the k gaze features with the highest mutual information scores and retain only those features in our final set of features. By doing so, we aimed to reduce the dimensionality of the feature space and improve the performance of our predictive models.