One of the most prominent problems that have occurred when using the eyes to control user interfaces is related to the fact that the eyes have evolved to observe and not change the environment. Therefore, the
Midas Touch problem has often been observed, which describes unintentional gaze behaviors that affect the interaction result (e.g., selection of the wrong menu button because of only glancing at it) [
99]. A lot of the previous work in eye-based HCI research has concentrated on developing interaction strategies that prevent the Midas Touch problem. Dwell time is a common approach to solve this problem, and researchers tested different dwell times based on the purpose of the interaction, such as target selection and manipulation [
218,
226]; replicating a single-button mouse for pointing, selecting, and dragging [
43]; and scrolling while reading [
119]. As the dwell time approach can cause fatigue and slow down the interaction in certain applications [
92], other approaches were developed to ease the interaction, which were either relying on gaze behavior, such as gaze gestures [
93] and fixation patterns [
133], or taking advantage of the other modalities in multi-modal platforms [
14,
26,
213,
218,
228,
258].
In this section, we present research that investigates eye-based interaction techniques in the XR domain. Hereby, we describe the modalities utilized to facilitate user interactions, identify solutions to the Midas Touch problem, and describe the different application areas benefiting from gaze-based interactions.
3.1.1 Eye-Only Interaction.
Many research efforts utilized the eye as the sole input for their XR interaction space. One of the common research questions in this domain was understanding how eye-only interaction compares to other input modalities, such as pointing or head-based interaction. Other researchers focused on developing and investigating solutions aimed at resolving issues specific to eye-only interaction, such as utilizing dwell time for the Midas touch problem. In the following, we provide a detailed description of these works and their findings.
Comparative Studies. In the real world, humans use their body to attend to their environment or communicate their attention to others, such as pointing or directing their head or eyes toward an object of interest. Therefore, these nonverbal cues are natural contenders for further investigation for target selection and manipulation tasks in XR. As XR technology advances and becomes more ubiquitous, the need for understanding the performance, social, and cultural requirements and implications of using different interaction modalities increases. For instance, for a specific task, eye-based interaction might turn out to be faster and more discreet than pointing. Looking at past works, understanding the performance capabilities of eye-based interactions was an important aspect of developing the interaction techniques, usually achieved through comparative studies with other input modalities.
In one of the earlier comparative studies in VR, Tanriverdi et al. compared an eye-based interaction technique with hand pointing for a search and selection task [
243]. They found that participants’ interactions were faster using the eye tracking-based method. Very commonly gaze input was compared with head rotation as input in augmented and virtual reality HMDs [
21,
82,
122,
167,
201]. Kyto et al. found that while the proposed head-only interaction technique was more accurate compared to eye tracking, the eye-only technique was faster [
122]. However, Qian et al. found faster selection times and higher accuracy values for head-based interaction compared to eye-only interaction [
201]. They also found that the head-only technique was overall more fatiguing than the eye-only technique, except neck fatigue, which was also observed by Blattgerste et al., who reported that users found eye-only interaction less exhausting than head-interaction [
21]. They further found that mostly less errors were observed using the eye-only method than the head-based method. Minakata et al. [
167] found that eye gaze was slower for pointing than head and foot-based controls. Choi et al. compared eye-gaze selection with head-rotation-based selection in a VR environment, and found that users preferred eye-gaze selection in terms of convenience and satisfaction, and they preferred head-rotation for ergonomics [
38]. Jalaliniya et al. compared target pointing on a head-mounted display using gaze, head and mouse, finding that eye-based pointing is significantly faster, while the users felt that head pointing is more accurate and convenient [
100]. Esteve et al. [
58] compared head rotation and eye pursuit in tracking of virtual targets. The results suggested that head-based input can more accurately track moving targets than using the eyes. Zhang et al. [
272] compared eye-gaze-based and controller-based controls in robot teleoperation. In their work, the use of eye gaze resulted in slower operations with more errors and had a negative impact on the user’s situational awareness and recall of the environment. Luro and Sundstedt [
145] compared eye gaze and controller-based aiming in VR and found that both performed similarly.
Dwell-Time.. One of the most common eye-only interaction methods is the
dwell-time approach. At this, the eyes have to be held on a target for a predefined set of time to trigger an input event. To provide ALS patients with more interactive capabilities, Lin et al. developed an HMD eye tracker, calibration, and data processing method to accurately detect user’s gaze and activate a speech system linked to different menu items and select those items [
138]. Graupner et al. evaluated the usability of a see-through HMD with gaze-based interaction capabilities and measured reaction time and hit rate in point selection tasks and investigated the influence of factors such as noise, sampling rate and target size [
74]. Nilsson et al. developed a gaze attentive AR video see through prototype for instructional purposes, illustrated in Figure
5, where users following sequential steps in the task could activate each step using interactive virtual buttons by fixating on them [
178,
179]. Rajanna and Hansen [
205] compared a dwell-time approach with clicking on a controller for typing on a virtual keyboard. They found that clicking on a controller was faster and produced less errors than the dwell-time approach. Voros et al. [
256] developed an interface to allow people with severe speech and physical impairments to select words from the world using gaze, and therefore communicating with others. Giannopoulos [
68] used dwell-time-based selection in a virtual retail environment. Cottin et al. integrated an
optical see through (OST) HMD with an eye tracker, to allow users to select virtual objects on the HMD screen with the dwell time approach in a SmartHome application [
45]. Liu et al. [
142] designed a gaze-only interface for adjusting the position of an object in 3D by adjusting its position on pre-defined planes.
Dwell-Time Alternatives.. The dwell-time approach is rather prone to the Midas Touch problem. To resolve this problem, several other approaches were proposed. To overcome some of the difficulties brought on by dwell-time-based gaze interaction methods, Lee et al. developed a novel approach by utilizing half blinks and gaze information to facilitate users with tasks such as target selection that was tested through interacting with augmented annotations in AR [
129]. Khamis et al. presented an approach that used smooth pursuit eye movements for selection of 3D targets in a virtual environment [
111]. They found that the movement is robust against target size, and detection improves with an increasing movement radius. Gao et al. developed an eye gesture interface, where combinations of eye movements are measured by an amplified AC-coupled electrooculograph [
63]. The proposed interface achieved a success rate of 97% in recognizing eye movement. Xiong et al. combined eye fixation and blink in a typing user interface [
266]. Toyama et al. combined sequence of eye fixations instead of fixations for each frame with object recognition algorithms to build an AR Museum guidance application [
247]. Hirata et al. [
85] designed an interface based on conscious change of eye vergence to select objects in 2D and 3D.
Continuous Input.. Gaze has been used as a continuous input signal for navigation and control tasks in virtual environments and teleoperation [
11,
118,
154,
233,
271]. Gaze has also been explored in narrative and tourism applications that provide users with information about different objects of interest by either detecting the gaze in a highlighted area of interest or during free exploration [
121,
267].
Overall, past works indicate a variety of applications where eye-only interaction was used. Although comparisons between eye-only interaction methods with other modalities have not always resulted in consistent findings, differences in type of HMDs and eye trackers utilized and the interaction tasks can explain some of these inconsistencies. Also, we observed that the ease of using the dwell-time approach has allowed for its adoption in a wide range of research topics from usability assessment [
74]to increasing users’ accessibility [
138,
256]. However, due to certain limitations that this approach can introduce, such as interaction time delays and user fatigue [
92,
205], we observed an increasing attention to alternative approaches [
63,
85,
111,
129,
247,
266]. The variety of these alternative approaches suggests the potential for eye-based interactions as a flexible interaction mechanism to allow for different user capabilities and interaction contexts. However, open questions exists regarding the usability and performance of these approaches compared to each other and different interaction modalities. Additionally, advances in eye tracking and HMD technologies and artificial intelligence algorithms hold promise for more streamlined interactions in the future.
3.1.2 Multi-Modal Interaction.
Understanding the capabilities of eye-only interaction is highly valuable, especially for circumstances concerning specific disabilities, where eyes are the only interaction input. However, combining eye-based interactions with other modalities (e.g., head-based and gesture-based interactions) can create a richer and more expressive experience for the user and also better facilitate certain complex tasks. In the following, we describe previous works that focused on combining eye input with different modalities.
Eye and Traditional Input. Continuous usage of devices that interface through mechanical inputs (e.g., button presses) has become ubiquitous, which includes cell phones and smart watches, making them ideal modality pairs for eye-based interactions for a wider audience. Sidorakis et al. presented a VR user interface combining gaze and an additional mechanical input to signify a selection [
224]. The multi-modal interaction scheme is evaluated to be more accurate than traditional mouse/keyboard interaction in an immersive virtual environment. Similar interaction technique is employed in a mobile-based AR game [
126] and wheelchair navigation [
89]. Sunggeun and Geehyuk [
3] explored the benefits of eye gaze and a control pad attached to a head-mounted display for typing and found this combination to outperform both exclusive eye gaze and control pad input. Bace et al. developed ubiGaze, where the interaction is based on gaze tracking and a smartwatch [
12]. The gaze provides selection of real-world objects, and the smartwatch can receive various commands to be executed with regard to the objects. Mardanbegi et al. [
151] combined gaze with a control tool attached to the controller to achieve combined selection of an object to interact with and a function.
Eye and Speech.. Speech-based interaction is one of the primary modalities used to communicate with intelligent characters in futuristic movies. However, due to limitations in technology, less has been done in pairing speech and eye-based interactions. Beach et al. developed one of the earlier multi-modal prototypes to provide hands-free interaction for users by utilizing speech and discussed the possible use of other modalities such as blinking or fixating on a desired target in case of inaccessibility of speech input [
17].
Eye and Gestures. In many eye-based interactions, using eye input for target selection leaves other modalities such as hands free to be utilized as input for other interactions such as object manipulations. Heo et al. developed a multi-modal interaction interface for gaming purposes -that includes eye, hand gesture, and bio-signal inputs [
81]. In their setup, pointing toward targets of interest was controlled using gaze, the gestures were used for selection and manipulations and the bio signals controlled the difficulty of the game. Pai et al. [
190] combined eye gaze and contractions of arm muscles measured by an EMG for subtle selection and interaction. Novak et al. integrated dwell time and intentional movement for VR-based patient rehabilitation [
181]. The system finds the focus of the patient in a VR environment via fixation, and if the patient’s intention to move is detected by the rehabilitation robot, then the robot will provide sufficient support for the patient.
Other multi-modal interaction approaches include the combination of eye tracking with freehand 3D gestures [
48,
122,
193,
219]. Deng et al. defined the spatial misperception problem that occurs during continuous indirect manipulation with a direct manipulation device [
48], and as such is observed when combining gaze and gesture input that leads to manipulation errors and user frustration. The authors introduce three methods, all of which improve the manipulation performance of virtual objects. Pfeuffer et al. introduced the
Gaze + Pinch interaction technique for virtual reality. Here a user’s gaze point is used to indicate the desired object of interaction, whereas pinch gestures are used for its manipulation, as such enabling interaction and manipulation with near and far objects. This technique simultaneously addresses the problem of the virtual hand metaphor that only allows for near interaction, and compared to controller-based methods the user is not required to constantly hold a device.
Eye and Head Rotation. A common approach is to combine eye tracking with head rotation. Techniques have been proposed to allow for hands-free navigation of virtual environments [
187,
192,
202,
222,
260]. Findings indicate that navigation techniques benefit from combining eye tracking with head rotation, since its able to correct for common problems related to eye tracking, such as calibration drifts [
202]. Sidenmark and Gellersen [
222] explored different combinations of eye and head gaze that leverage synergetic movement of eye and gaze for selection and exploration of an environment. It was also found that the combination techniques perform better than the eye only techniques [
122,
202]. Piumsomboon et al. proposed three eye-based interaction techniques for navigation and selection in virtual reality [
197]. At this, they leveraged specific properties of various eye movements. The
Vestibulo-Ocular Reflex (VOR) was for example used for a navigation task, whereas an eye only technique was proposed for selecting targets. These results suggest that different eye-based interaction possibilities should not be used competitively, but that there should be specific interaction possibilities for specific tasks in augmented and virtual environments. Mardanbengi et al. also proposed to use the VOR for improving selection. However, in their work VOR was explored in the context of 3D gaze estimation in particular in comparison to vergence where their approach using VOR depth estimation showed similar performance in several scenarios despite requiring only one tracked eye [
150].
Eye and BCI. Utilizing
brain–computer interfaces (BCIs) in a hybrid form (e.g., Eye + BCI) can increase the performance of the whole system [
194]. Ma et al. combined a brain–computer interface with eye tracking for typing in virtual reality [
147,
269]. A similar setup has been applied in 3D object manipulation [
40] and horizontal scrolling and selection interface [
156]. Putze et al. [
200] combined eye tracking and steady state visually evoked potential to improve the robustness of target selection.
Overall, we observed a wide range of modalities paired with eye-based input spread over various applications, such as increasing accessibility, health care, and entertainment. Some modalities, including traditional input [
3,
12,
89,
126,
151,
224], head rotation [
150,
187,
192,
197,
202,
222,
260], and gestures [
48,
81,
122,
181,
190,
193,
219], were more commonly investigated. This can be explained by the fact that some of these modalities are more well-established (i.e., traditional input), and in some cases, others are already paired in one device or have dedicated resources for pairing, for instance, HMDs with eye trackers like FOVE
1 and HP Omnicept
2 or eye tracking and hand tracking add-ons like Pupil Labs
3 or Leap Motion.
4 Separately, advances in natural language processing, and ubiquity of the speech modality evident from the popularity of digital home assistants, such as Amazon Alexa and Google Home, holds promise for more research on the combination of speech and eye input as we only identified one example in our review [
17]. When considering eye gaze for interaction in VR we should not forget the impact of head and torso movement in particular as VR is increasingly moving toward a fully tracked free movement. Sidenmark and Gellersen [
221] recently explored the coordination between eye, head, and the torso when looking at targets in VR. Their findings gave insights into the coordination of these body parts and highlighted that when designing gaze-based interfaces these modalities should be considered as a whole and not separately.