survey

Public Access

The Eye in Extended Reality: A Survey on Gaze Interaction and Eye Tracking in Head-worn Extended Reality

Authors:

Tobias LanglotzAuthors Info & Claims

ACM Computing Surveys (CSUR), Volume 55, Issue 3

Article No.: 53, Pages 1 - 39

https://doi.org/10.1145/3491207

Published: 25 March 2022 Publication History

All formats PDF

Abstract

With innovations in the field of gaze and eye tracking, a new concentration of research in the area of gaze-tracked systems and user interfaces has formed in the field of Extended Reality (XR). Eye trackers are being used to explore novel forms of spatial human–computer interaction, to understand human attention and behavior, and to test expectations and human responses. In this article, we review gaze interaction and eye tracking research related to XR that has been published since 1985, which includes a total of 215 publications. We outline efforts to apply eye gaze for direct interaction with virtual content and design of attentive interfaces that adapt the presented content based on eye gaze behavior and discuss how eye gaze has been utilized to improve collaboration in XR. We outline trends and novel directions and discuss representative high-impact papers in detail.

1 Introduction

Head-Mounted Displays (HMDs) were first introduced in the groundbreaking essay on “The Ultimate Display” [241] and the first practical implementation by Sutherland et al. [242]. Since the late 1960s, the design of HMDs has undergone many changes with significant improvements in the design, optical composition, and tracking and rendering capabilities. More than 5.5 million units are expected to have been delivered in 2020, and this number is expected to rise to more than 40 million units in 2025 [231]. This development in HMDs will introduce Extended Reality (XR) experiences into our everyday lives ranging from medicine and industry to entertainment and games or even to casually worn eyewear. Hereby, we define XR to include elements of the Mixed Reality (MR) continuum defined by Milgram et al. [164] and Virtual Reality (VR), more precisely XR contains Augmented Reality (AR), VR, and MR.

With the introduction of handheld devices, we shifted from using keyboard and mouse to touch-based interaction with digital information. With the shift toward head-worn devices, such HMDs will likely bring a similar paradigm shift in our interactions with the digital content. Thus far, the interaction method for HMDs has not been universally established or standardized with commercial devices utilizing a variety of interaction methods. Most commercial HMDs utilize handheld controllers (HTC Vive, MagicLeap One, Oculus Rift), touch surfaces on the device itself (Google Glass), voice (Microsoft HoloLens), and gesture interfaces (Microsoft HoloLens, Meta2, Oculus Quest 2). Some researchers have also explored the use of handheld devices as a means of replacing dedicated controllers and to give users a touch surface with which to interact [168]. While controllers made of handheld devices can help bridge the gap from the familiar interfaces on handheld devices or game controllers toward wearable interfaces, this requires users to carry around a dedicated device for interaction. Gesture and voice interaction could be a means of overcoming this limitation in some scenarios, but they cannot be used in more crowded environments because of noise or social issues, such as on crowded streets or in public transportation [77]. An ideal interaction method should be readily available, intuitive and fast, and inconspicuous without attracting attention from bystanders.

Eye gaze has the potential to be a key part of this interaction method and has long been envisioned as a natural interaction modality [98]. Our eyes can showcase the users’ interest and can provide empirical information about how users perceive a scene, what they notice and pay attention to, or what they are primarily interested in [123, 125].

This wealth of information facilitated cognitive researchers in understanding human cognitive processes [53] and enabled human–computer interfaces with a new modality where user’s intent can be inferred from their eye gaze. Such gaze-based interaction techniques resulted in the development of applications for a wide range of purposes from increasing user accessibility [80, 89, 118, 134, 138, 154, 155, 156, 256, 260, 270] to entertainment [81, 126].

To classify gaze-based interactions, Majaranta and Bulling introduced an eye tracking application continuum revolving around users’ level of intent during an interaction, i.e., intentional vs. unintentional interaction, and the required responsiveness of the system one is interacting with, i.e., online vs. offline [149]. For online/active interfaces, one’s gaze can be used to explicitly select or manipulate targets on a computer screen [227], or implicitly adjust the resolution of an image on a display so higher resolution is in line with user’s center of attention allocating lower resolution to the periphery [51]. Offline systems are utilized to either create a model of the user’s attention and cognitive processes based on their gaze behavior or for diagnostics purposes [149].

Separately, our interactions with computers in one form or another are constantly increasing and are becoming more personalized. Although eye tracking has been conceptualized and investigated as a natural means to streamline such interactions, e.g., by implicitly modifying the behavior of virtual avatars according to user’s gaze [173] or triggering interactions with virtual content when the system detects the user’s interest in it [218], thus far its applications in XR have been limited, primarily due to the lack of accessible hardware. This limitation is slowly disappearing in newer HMD iterations that integrate eye tracking capabilities, e.g., FOVE, HTC Vive, HoloLens2, and MagicLeap One. The advances in XR technology and its increased popularity have led to research opportunities for new and interactive ways of utilizing one’s eye/gaze information to facilitate various interactions.

Most authors of this review discussed the potential of eye-gaze tracking in XR during the NII Shonan Seminar on Augmented Reality in Human-Computer Interaction [177] with participants from HCI and XR. Discussions revealed that in recent years there was a significantly increased interest in eye tracking in XR and HCI, and this review should help stimulate new research opportunities in this area by identifying and structuring existing works and revealing key questions for future work in gaze interaction and eye tracking in XR.

In summary, this review investigates research in the use of eye/gaze tracking in XR environments to provide answers for the following questions:

•

Q1: What are the main categories of gaze interaction and eye tracking research for XR interfaces?

•

Q2: What sub-categories within each research category have garnered more attention?

•

Q3: What are some of the emerging and future research directions for gaze interaction and eye tracking in XR?

In this work, we contribute to the research community by providing summaries of the research efforts in the aforementioned areas from 1985 to 2020, identifying underrepresented directions and novel solutions, and promising future applications. We hope that our efforts can provide both a historical view and spark innovative ideas for researchers in the fields of eye/gaze tracking, XR, and HCI. In the remainder of this article, we discuss the methodology for our review and introduce our review topics in Section 2 and provide a high-level analysis of the research contributions. In Section 3, we expand upon research efforts specific to our review topics. We then provide insights on past research trends and future directions in Section 4 and conclude the article in Section 5.

2 Methodology

We adopted a two-step procedure for our review of eye and gaze tracking in XR. The first step involved data collection and identification of relevant publications. In the second step, we defined our review topics, further identified the papers that made contributions to these topics and provided summaries on their research findings.

The focus of our review are eye tracking applications in XR and here specifically for HMDs. As such, we identified the related papers on SCOPUS by searching for papers that included XR-related index terms “Augmented Reality” OR “Virtual Reality” OR “Mixed Reality” OR “Head Mounted Display” OR “Head Worn Display” OR “Eye Wear” and eye tracking related terms “Eye Tracking” OR “Eye Gaze” in the title, keywords, and the abstract fields (see Figure 1). This search resulted in 1,278 papers published between 1985 and May 20, 2020 (see Figure 2). We opted to include recent papers in the review to cover recent trends in our review. Furthermore, we did not exclude papers based on the number of citations to ensure that we do not exclude ideas that may be novel but overlooked for many years or are recent publications. After compiling the paper list, we discussed the classification criteria related to eye tracking research and applications for the papers and distributed the papers among the members of our team for an initial review, summary, and classification. We first classified 50 papers to determine any other classification criteria that were predominant in the collected papers and classified all papers according to this expanded classification criteria. The complete set of the 15 classification criteria is shown in Figure 1. After this review cycle, we removed 331 papers that mentioned the keywords but did not utilize XR or eye tracking, e.g., mentioning importance of eye contact in communication or developing an eye tracking algorithm with a mention of XR as a potential application area, 1 paper due to plagiarism, and 90 papers that we could not access. This procedure resulted in a list of 856 papers that addressed eye tracking in XR and adhered to our classification criteria.

Fig. 1.

Fig. 2.

After the initial classification, the authors discussed and identified areas of relevance for the review that also encompass all the identified papers. Inspired by the continuum of eye tracking applications by Majaranta and Bulling [149], we selected the two areas explicit gaze interaction and implicit gaze interaction. Furthermore, we selected collaborative gaze interaction as the third area. The authors agreed on these topics, as they believed them to be directly relevant to the design of interfaces for a detailed summary.

We again distributed the remaining 856 papers and conducted an in-depth review of the papers in each category and removed papers that did not match the focus of each category (e.g., by mentioning eye tracking for selecting targets) but focused on the development of an eye tracking algorithm or did not utilize eye tracking for interaction (e.g., assuming that the center of the user’s view corresponds to the gaze point or collecting eye gaze information but utilizing other means to facilitate interaction). We also excluded duplicate papers and papers that were different variations of the same paper, e.g., a demo or a poster of a conference paper. This resulted in a final set of 215 papers that were included in the final review. Of these papers, 99 utilized eye tracking for explicit eye input, 53 papers presented implicit user interfaces, and 63 papers focused on collaborative gaze interaction (see the overall process in Figure 1). These papers were again distributed among the members who reviewed the papers in detail, organized them into subcategories, and identified recent and future directions. It is important to note that some works that consider a specific property of the eye are missing from our review even though they may address the identified research categories [35, 124, 240]. Due to their focus on a specific property of the eye they utilize keywords like “saccades” or “pupil dilation” rather than “eye tracking” or “gaze tracking,” meaning they did not match our selection criteria. The structure of our paper discussion is shown in Figure 3.

Fig. 3.

3 Research Topics and Directions

We identified three main categories of eye tracking for interaction in XR. The first group utilizes eye tracking similar to a mouse on the computer, the user can target different objects and select them for interaction. The second group analyzes the user’s gaze to adapt system parameters without users actively triggering these changes. Finally, eye gaze plays an important role in collaborative XR environments. In this section we summarize research outcomes in each of these categories.

3.1 Explicit Eye Input

Gaze has been identified as a natural means of interaction in the HCI domain, as humans gaze at what they are attending or planning to attend to [125, 149]. Several properties of gaze, such as its fast and direct availability, have been leveraged and applied for intentionally performing an interaction event with the eyes (explicit gaze input). At this, gaze was mainly used for two scenarios: as sole input signal or in combination with other modalities. Figure 4 represents a few of these examples.

Fig. 4.

One of the most prominent problems that have occurred when using the eyes to control user interfaces is related to the fact that the eyes have evolved to observe and not change the environment. Therefore, the Midas Touch problem has often been observed, which describes unintentional gaze behaviors that affect the interaction result (e.g., selection of the wrong menu button because of only glancing at it) [99]. A lot of the previous work in eye-based HCI research has concentrated on developing interaction strategies that prevent the Midas Touch problem. Dwell time is a common approach to solve this problem, and researchers tested different dwell times based on the purpose of the interaction, such as target selection and manipulation [218, 226]; replicating a single-button mouse for pointing, selecting, and dragging [43]; and scrolling while reading [119]. As the dwell time approach can cause fatigue and slow down the interaction in certain applications [92], other approaches were developed to ease the interaction, which were either relying on gaze behavior, such as gaze gestures [93] and fixation patterns [133], or taking advantage of the other modalities in multi-modal platforms [14, 26, 213, 218, 228, 258].

Whereas most traditional interaction devices (e.g. laptop, smartphone) rely on two-dimensional displays, this is different for XR devices. Those augment the real (three-dimensional (3D)) world with digital information or create even an entirely new three-dimensional world. Still, understanding the eye-only or multi-modal approaches developed for 2D spaces [10, 14, 15, 16, 26, 28, 61, 94, 95, 172, 182, 213, 218, 226, 228, 258] can be beneficial for the interactions in 3D space. Therefore, it has to be investigated to which extent existing eye-based interaction techniques can be transferred to 3D space or if new approaches that are specifically tailored to XR requirements have to be developed.

In this section, we present research that investigates eye-based interaction techniques in the XR domain. Hereby, we describe the modalities utilized to facilitate user interactions, identify solutions to the Midas Touch problem, and describe the different application areas benefiting from gaze-based interactions.

3.1.1 Eye-Only Interaction.

Many research efforts utilized the eye as the sole input for their XR interaction space. One of the common research questions in this domain was understanding how eye-only interaction compares to other input modalities, such as pointing or head-based interaction. Other researchers focused on developing and investigating solutions aimed at resolving issues specific to eye-only interaction, such as utilizing dwell time for the Midas touch problem. In the following, we provide a detailed description of these works and their findings.

Comparative Studies. In the real world, humans use their body to attend to their environment or communicate their attention to others, such as pointing or directing their head or eyes toward an object of interest. Therefore, these nonverbal cues are natural contenders for further investigation for target selection and manipulation tasks in XR. As XR technology advances and becomes more ubiquitous, the need for understanding the performance, social, and cultural requirements and implications of using different interaction modalities increases. For instance, for a specific task, eye-based interaction might turn out to be faster and more discreet than pointing. Looking at past works, understanding the performance capabilities of eye-based interactions was an important aspect of developing the interaction techniques, usually achieved through comparative studies with other input modalities.

In one of the earlier comparative studies in VR, Tanriverdi et al. compared an eye-based interaction technique with hand pointing for a search and selection task [243]. They found that participants’ interactions were faster using the eye tracking-based method. Very commonly gaze input was compared with head rotation as input in augmented and virtual reality HMDs [21, 82, 122, 167, 201]. Kyto et al. found that while the proposed head-only interaction technique was more accurate compared to eye tracking, the eye-only technique was faster [122]. However, Qian et al. found faster selection times and higher accuracy values for head-based interaction compared to eye-only interaction [201]. They also found that the head-only technique was overall more fatiguing than the eye-only technique, except neck fatigue, which was also observed by Blattgerste et al., who reported that users found eye-only interaction less exhausting than head-interaction [21]. They further found that mostly less errors were observed using the eye-only method than the head-based method. Minakata et al. [167] found that eye gaze was slower for pointing than head and foot-based controls. Choi et al. compared eye-gaze selection with head-rotation-based selection in a VR environment, and found that users preferred eye-gaze selection in terms of convenience and satisfaction, and they preferred head-rotation for ergonomics [38]. Jalaliniya et al. compared target pointing on a head-mounted display using gaze, head and mouse, finding that eye-based pointing is significantly faster, while the users felt that head pointing is more accurate and convenient [100]. Esteve et al. [58] compared head rotation and eye pursuit in tracking of virtual targets. The results suggested that head-based input can more accurately track moving targets than using the eyes. Zhang et al. [272] compared eye-gaze-based and controller-based controls in robot teleoperation. In their work, the use of eye gaze resulted in slower operations with more errors and had a negative impact on the user’s situational awareness and recall of the environment. Luro and Sundstedt [145] compared eye gaze and controller-based aiming in VR and found that both performed similarly.

Dwell-Time.. One of the most common eye-only interaction methods is the dwell-time approach. At this, the eyes have to be held on a target for a predefined set of time to trigger an input event. To provide ALS patients with more interactive capabilities, Lin et al. developed an HMD eye tracker, calibration, and data processing method to accurately detect user’s gaze and activate a speech system linked to different menu items and select those items [138]. Graupner et al. evaluated the usability of a see-through HMD with gaze-based interaction capabilities and measured reaction time and hit rate in point selection tasks and investigated the influence of factors such as noise, sampling rate and target size [74]. Nilsson et al. developed a gaze attentive AR video see through prototype for instructional purposes, illustrated in Figure 5, where users following sequential steps in the task could activate each step using interactive virtual buttons by fixating on them [178, 179]. Rajanna and Hansen [205] compared a dwell-time approach with clicking on a controller for typing on a virtual keyboard. They found that clicking on a controller was faster and produced less errors than the dwell-time approach. Voros et al. [256] developed an interface to allow people with severe speech and physical impairments to select words from the world using gaze, and therefore communicating with others. Giannopoulos [68] used dwell-time-based selection in a virtual retail environment. Cottin et al. integrated an optical see through (OST) HMD with an eye tracker, to allow users to select virtual objects on the HMD screen with the dwell time approach in a SmartHome application [45]. Liu et al. [142] designed a gaze-only interface for adjusting the position of an object in 3D by adjusting its position on pre-defined planes.

Fig. 5.

Dwell-Time Alternatives.. The dwell-time approach is rather prone to the Midas Touch problem. To resolve this problem, several other approaches were proposed. To overcome some of the difficulties brought on by dwell-time-based gaze interaction methods, Lee et al. developed a novel approach by utilizing half blinks and gaze information to facilitate users with tasks such as target selection that was tested through interacting with augmented annotations in AR [129]. Khamis et al. presented an approach that used smooth pursuit eye movements for selection of 3D targets in a virtual environment [111]. They found that the movement is robust against target size, and detection improves with an increasing movement radius. Gao et al. developed an eye gesture interface, where combinations of eye movements are measured by an amplified AC-coupled electrooculograph [63]. The proposed interface achieved a success rate of 97% in recognizing eye movement. Xiong et al. combined eye fixation and blink in a typing user interface [266]. Toyama et al. combined sequence of eye fixations instead of fixations for each frame with object recognition algorithms to build an AR Museum guidance application [247]. Hirata et al. [85] designed an interface based on conscious change of eye vergence to select objects in 2D and 3D.

Continuous Input.. Gaze has been used as a continuous input signal for navigation and control tasks in virtual environments and teleoperation [11, 118, 154, 233, 271]. Gaze has also been explored in narrative and tourism applications that provide users with information about different objects of interest by either detecting the gaze in a highlighted area of interest or during free exploration [121, 267].

Overall, past works indicate a variety of applications where eye-only interaction was used. Although comparisons between eye-only interaction methods with other modalities have not always resulted in consistent findings, differences in type of HMDs and eye trackers utilized and the interaction tasks can explain some of these inconsistencies. Also, we observed that the ease of using the dwell-time approach has allowed for its adoption in a wide range of research topics from usability assessment [74]to increasing users’ accessibility [138, 256]. However, due to certain limitations that this approach can introduce, such as interaction time delays and user fatigue [92, 205], we observed an increasing attention to alternative approaches [63, 85, 111, 129, 247, 266]. The variety of these alternative approaches suggests the potential for eye-based interactions as a flexible interaction mechanism to allow for different user capabilities and interaction contexts. However, open questions exists regarding the usability and performance of these approaches compared to each other and different interaction modalities. Additionally, advances in eye tracking and HMD technologies and artificial intelligence algorithms hold promise for more streamlined interactions in the future.

3.1.2 Multi-Modal Interaction.

Understanding the capabilities of eye-only interaction is highly valuable, especially for circumstances concerning specific disabilities, where eyes are the only interaction input. However, combining eye-based interactions with other modalities (e.g., head-based and gesture-based interactions) can create a richer and more expressive experience for the user and also better facilitate certain complex tasks. In the following, we describe previous works that focused on combining eye input with different modalities.

Eye and Traditional Input. Continuous usage of devices that interface through mechanical inputs (e.g., button presses) has become ubiquitous, which includes cell phones and smart watches, making them ideal modality pairs for eye-based interactions for a wider audience. Sidorakis et al. presented a VR user interface combining gaze and an additional mechanical input to signify a selection [224]. The multi-modal interaction scheme is evaluated to be more accurate than traditional mouse/keyboard interaction in an immersive virtual environment. Similar interaction technique is employed in a mobile-based AR game [126] and wheelchair navigation [89]. Sunggeun and Geehyuk [3] explored the benefits of eye gaze and a control pad attached to a head-mounted display for typing and found this combination to outperform both exclusive eye gaze and control pad input. Bace et al. developed ubiGaze, where the interaction is based on gaze tracking and a smartwatch [12]. The gaze provides selection of real-world objects, and the smartwatch can receive various commands to be executed with regard to the objects. Mardanbegi et al. [151] combined gaze with a control tool attached to the controller to achieve combined selection of an object to interact with and a function. Eye and Speech.. Speech-based interaction is one of the primary modalities used to communicate with intelligent characters in futuristic movies. However, due to limitations in technology, less has been done in pairing speech and eye-based interactions. Beach et al. developed one of the earlier multi-modal prototypes to provide hands-free interaction for users by utilizing speech and discussed the possible use of other modalities such as blinking or fixating on a desired target in case of inaccessibility of speech input [17]. Eye and Gestures. In many eye-based interactions, using eye input for target selection leaves other modalities such as hands free to be utilized as input for other interactions such as object manipulations. Heo et al. developed a multi-modal interaction interface for gaming purposes -that includes eye, hand gesture, and bio-signal inputs [81]. In their setup, pointing toward targets of interest was controlled using gaze, the gestures were used for selection and manipulations and the bio signals controlled the difficulty of the game. Pai et al. [190] combined eye gaze and contractions of arm muscles measured by an EMG for subtle selection and interaction. Novak et al. integrated dwell time and intentional movement for VR-based patient rehabilitation [181]. The system finds the focus of the patient in a VR environment via fixation, and if the patient’s intention to move is detected by the rehabilitation robot, then the robot will provide sufficient support for the patient.

Other multi-modal interaction approaches include the combination of eye tracking with freehand 3D gestures [48, 122, 193, 219]. Deng et al. defined the spatial misperception problem that occurs during continuous indirect manipulation with a direct manipulation device [48], and as such is observed when combining gaze and gesture input that leads to manipulation errors and user frustration. The authors introduce three methods, all of which improve the manipulation performance of virtual objects. Pfeuffer et al. introduced the Gaze + Pinch interaction technique for virtual reality. Here a user’s gaze point is used to indicate the desired object of interaction, whereas pinch gestures are used for its manipulation, as such enabling interaction and manipulation with near and far objects. This technique simultaneously addresses the problem of the virtual hand metaphor that only allows for near interaction, and compared to controller-based methods the user is not required to constantly hold a device.

Eye and Head Rotation. A common approach is to combine eye tracking with head rotation. Techniques have been proposed to allow for hands-free navigation of virtual environments [187, 192, 202, 222, 260]. Findings indicate that navigation techniques benefit from combining eye tracking with head rotation, since its able to correct for common problems related to eye tracking, such as calibration drifts [202]. Sidenmark and Gellersen [222] explored different combinations of eye and head gaze that leverage synergetic movement of eye and gaze for selection and exploration of an environment. It was also found that the combination techniques perform better than the eye only techniques [122, 202]. Piumsomboon et al. proposed three eye-based interaction techniques for navigation and selection in virtual reality [197]. At this, they leveraged specific properties of various eye movements. The Vestibulo-Ocular Reflex (VOR) was for example used for a navigation task, whereas an eye only technique was proposed for selecting targets. These results suggest that different eye-based interaction possibilities should not be used competitively, but that there should be specific interaction possibilities for specific tasks in augmented and virtual environments. Mardanbengi et al. also proposed to use the VOR for improving selection. However, in their work VOR was explored in the context of 3D gaze estimation in particular in comparison to vergence where their approach using VOR depth estimation showed similar performance in several scenarios despite requiring only one tracked eye [150].

Eye and BCI. Utilizing brain–computer interfaces (BCIs) in a hybrid form (e.g., Eye + BCI) can increase the performance of the whole system [194]. Ma et al. combined a brain–computer interface with eye tracking for typing in virtual reality [147, 269]. A similar setup has been applied in 3D object manipulation [40] and horizontal scrolling and selection interface [156]. Putze et al. [200] combined eye tracking and steady state visually evoked potential to improve the robustness of target selection.

Overall, we observed a wide range of modalities paired with eye-based input spread over various applications, such as increasing accessibility, health care, and entertainment. Some modalities, including traditional input [3, 12, 89, 126, 151, 224], head rotation [150, 187, 192, 197, 202, 222, 260], and gestures [48, 81, 122, 181, 190, 193, 219], were more commonly investigated. This can be explained by the fact that some of these modalities are more well-established (i.e., traditional input), and in some cases, others are already paired in one device or have dedicated resources for pairing, for instance, HMDs with eye trackers like FOVE¹ and HP Omnicept² or eye tracking and hand tracking add-ons like Pupil Labs³ or Leap Motion.⁴ Separately, advances in natural language processing, and ubiquity of the speech modality evident from the popularity of digital home assistants, such as Amazon Alexa and Google Home, holds promise for more research on the combination of speech and eye input as we only identified one example in our review [17]. When considering eye gaze for interaction in VR we should not forget the impact of head and torso movement in particular as VR is increasingly moving toward a fully tracked free movement. Sidenmark and Gellersen [221] recently explored the coordination between eye, head, and the torso when looking at targets in VR. Their findings gave insights into the coordination of these body parts and highlighted that when designing gaze-based interfaces these modalities should be considered as a whole and not separately.

3.2 Implicit or Adaptive and Attentive User Interfaces

Apart from XR interfaces that are using eye tracking data for explicit input and selection we identified a second category of XR interfaces that utilizes real-time eye-gaze information. We can summarize this category as adaptive and attentive user interfaces. Adaptive user interfaces are often defined as “an interface that remains well designed even as its world changes” [27]. While initially often used to describe user interfaces that can be adapted explicitly by the user (adaptability) we focus here more on approaches where the user interface is implicitly adapted through the system (adaptivity) [77]. More specifically, in the context of this work the eye and gaze information are used as a context source to control the adaption of the system. Recent works by Grubert et al. [77] highlighted the importance of adaptivity and context-awareness in particular for future AR applications once AR starts to transition from an interface that is sporadically used (e.g., such as an AR app on a mobile phone) to an interface that is continuously used in various contexts (“Pervasive Augmented Reality”). An example of the latter would be HMDs such as the MS HoloLens that are completely designed around the usage of AR as an interface, can serve multiple purposes and thus can be envisioned to be worn over extended periods and in different contexts. The concept of adaptive user interfaces is related to the concept of attentive user interfaces that could be seen as a subcategory. Attentive user interfaces are defined as interfaces that “are sensitive to the user’s attention” [252]. The difference to adaptive and context-aware interfaces is the focus on attention to minimize disruption from the main task and maximize peripheral support. A common example of how eye data can be used here is to adjust the behavior of the interface by processing the user’s real-time eye data and predicting the user’s focus and interest. There are other definitions of attentive user interfaces [149] with a focus more on implicit user interaction such as non-command user interfaces [176], but we would similarly argue that they are a sub-genre of adaptive user interfaces. We show examples of gaze-adaptive interfaces in Figure 6.

Fig. 6.

In the following, we discuss the main directions of works in this category. We group the identified works by focusing on the context targets (what is adapted) as proposed by Grubert et al. [77] and try to reuse the original categories for different context targets when applicable.

3.2.1 Information Management, Spatial Presentation, and View Management.

View management is a term commonly used to describe the issue of where to show user interface elements or digital overlays within an AR interface [73]. In general, view-management techniques that adapt to the context can be classified as techniques that were initially designed for desktop and handheld interfaces systems but could be applied within VR or AR as well as techniques that were designed specifically for head-mounted displays implementing a VR or AR interface. While some prior works used saliency information to estimate the user’s gaze and important scene features worth preserving [73], tracking the human gaze in real time can also help identify areas where to show or not show digital overlays. A simple example is the work by Scholte et al. [217], who modified the location where important information appears to improve view management for car heads-up displays. In particular, they showed warning information within the direction of the user’s gaze to reduce reaction times.

Many concepts of view management have been previously explored on desktop and mobile devices. With the increasing interest in creating virtual [78] or augmented desktop environments [206], similar modification techniques could find application in XR as well. An early concept of an attentive interface for desktop machines is EyeWindows [61]. EyeWindows enlarged windows the user focused at to address clutter when users had multiple windows open at the same time. Enlarging the window currently in the user’s focus and shrinking other windows accordingly helped users to more quickly acquire and transcribe information from them. Identifying user activities in the targeted window can be applied for automatic content management. Kumar et al. [119] introduced an automatic scrolling interface for users reading a website or an email. Whenever the user’s gaze goes beyond a predefined threshold the content is scrolled automatically with an ever increasing speed as the user’s gaze comes closer to the edge of the screen, thus keeping the gaze close to the center of the screen and eliminating the need for continuous scrolling via gestures or peripheral devices. Toyama et al. [249] applied this concept to view management on an OST-HMD. Whenever the system detected the user’s gaze on the virtual text it would either highlight where the user stopped reading, or automatically scroll the text if the user is reading. The system could also detect when the user did not check important information for a long time and highlight it through a time-dependent urgency indicated, e.g., an outline [183].

Contrary to traditional displays that present users only with 2D information, virtual content in HMDs is commonly viewed in 3D. Especially when multiple layers of virtual content are shown to the user or overlaid over the scene, it is important to provide a natural interface for switching between the different content planes. The user’s focus distance within the 3D space has been envisioned as a natural cue to distinguish what content the user is currently focused on. A common approach is to blend out content that is not in focus [131, 191, 248, 249]. As estimating the user’s focus depth is prone to errors [150], common user behavior such as squinting when focusing on an object far away [191], the VOR [249], and other aspects of the scene can help disambiguate the focused object. Saraiji et al. [214] analyzed the saliency of multiple overlapping views shown in VR to determine the most likely layer in focus at the user’s gaze location. They blurred out other layers creating an artificial depth-of-field effect.

Another difference to traditional 2D interfaces is that the overlaid content overlays the scene. This means that users may be presented with too much information (information overload) or lose the context due to information being overlaid onto it. Nakao et al. [174] investigated different text visualization techniques for AR HMDs that considered the environment. They initially measured the required attention for a given set of predefined environments and tasks (e.g., for walking stairs) and showed that it is hard to keep the attention on the HMD when doing certain tasks. They then proposed different visualization methods that required less attention but are only briefly evaluated. McNamara et al. [159, 160, 161] adjusted the visibility of labels dependent on their proximity to the user’s gaze. They suggested a distance-based dimming function that dims labels that are too far from the user’s view as well as a time-based dimming approach where labels disappear shortly after the user’s gaze moved away. However, they evaluated this approach only in a preliminary study on a desktop and a tablet device. Gebhardt et al. [64] suggested that instead of presenting all additional information in a scene it should be added only when a user’s gaze pattern indicates their interest in said object. Although this reduces clutter in the scene, it does not prevent virtual content, e.g., labels, from occluding relevant real content. Tönis and Klinker [245, 246] addressed this by attaching the virtual content to the user’s gaze, so it is presented close to but does not overlap with the user’s focus. When the user’s gaze moves toward the attached information the system registers that the user intends to interact with the virtual information. If, however, the user moves the gaze quickly somewhere else, then the virtual information detaches and moves back to its original location. They found that participants preferred more stabilized virtual content that exhibited less movement.

Finally, very recent works created a model for interactively placing virtual information based on the users cognitive load (measured using eye data), their task, and their environment [139]. The model interactively controls what type of information is shown, the placement of the information, and the amount of information displayed. As such it emphasizes the content aware view and information management also requested by the initial concept of Pervasive Augmented Reality [77]. In summary, adapting the XR view is an important topic when considering long-term usage of XR interfaces and here in particular wearable AR interfaces. If AR glasses become omnipresent and the next mobile phone, then they have to adjust based on the users context. Despite this observation it is obvious that we are only at the beginning as the models for computing the context information using gaze are still often basic and the actual effectiveness of adapting the interface are yet to be explored.

3.2.2 Information Management and Visual Presentation.

Eye tracking is often associated with the user’s attention and comprehension. Detecting objects users are interested in, can be applied to present additional relevant information. Toyama et al. [248] applied this principle to present related information about content users are reading on an OST-HMD by analyzing where the user’s gaze is in the text.

Presenting additional information does not have to be limited to 2D information, but can also be applied to different objects in the scene. Ajanki et al. developed an augmented reality platform for accessing abstract information in real-world pervasive computing environments by inferring user’s focus of attention through signals such as gaze patterns and speech, for applications such as user guides or meetings [5]. Ivaschenko et al. [97] identified objects in user’s focus through eye tracking to optimize what information to show in an AR supported manufacturing application. Moniri et al. [169] considered the amount of presented information about an object to the user. They suggest utilizing the object position relative to the user’s gaze and its distance from the user to determine its visibility. Objects that have low visibility could blink to attract the user’s attention. When an object has medium visibility, users see few large words, and when an object is in high visibility (looked at) a lot of information is shown. A similar idea was presented by Gras and Yang [72] who adjusted the visualization within a surgery context based on the user’s gaze and the state of surgery instruments to either show no overlay, a partial overlay, or a full overlay to the surgeon. Although they tested their system only on a desktop it can be directly transferred to an HMD.

A similar concept was applied by Giannopoulos et al. [66] for navigating users, whenever they came to a cross-road and were unsure what direction to turn to. Their system tracked the user’s gaze direction and vibrated the mobile phone when the user looked in the correct direction to move to. They found that participants preferred using their system compared to a map-based navigation. The user’s confidence in navigating an environment [8], performing a medical procedure [72], or training could also be derived from their gaze patterns. The system could then provide assistance only when the user requires it, potentially reducing the mental demand and clutter.

Sometimes, instead of presenting additional information about the scene, it is more important to guide the user’s gaze toward an important location. Eaddy et al. [52] aimed to guide user’s attention to important locations when viewing a map. By detecting the user’s gaze, they provided directions toward locations of interest. While Eaddy et al. [52] actively directed the user’s gaze toward a target, in some situations, e.g., art exhibitions, it may be preferable to unobtrusively guide the user’s attention toward areas of interest. McNamara et al. [157] investigated the effects of subtle modulation of content brightness on gaze attraction in a search task. They utilized eye tracking to activate modulation when the user’s gaze moved away from the target and to deactivate it when it was in the user’s focus. They found that this modulation significantly improved the user’s answers. Furthermore, increasing the size of the modulation to be more obvious did not significantly improve the results compared to subtle modulation. They extended their work [158] to also investigate the effects of distractors (modulation of other areas in an image) on the search performance. Their results show that despite the additional distractors, reported results were better than when no modulation was presented. While in their work McNamara et al. focused only on brightness modulation, other modulation, such as blur, zooming, or content movement, can be considered as well. Although the effectiveness of content modulation is an effective guidance on displays and in environments where only a small portion of the user’s view is augmented, their effectiveness cannot be guaranteed in more natural environments. Instead of modifying the brightness of the target area for both eyes, Grogorick et al. [75] suggested to increase the brightness for one eye, while reducing it for the other eye. They found that although this method can attract the user’s gaze, its effectiveness may depend on the complexity of the environment. Grogorick et al. [76] investigated the effectiveness of different gaze guidance techniques within a 160 \(\times\) 90 \(^{\circ }\) FOV immersive scenario. However, they did not find any of the techniques to be outperforming or to achieve attraction rates of more than 50% within 1 s of the stimulus onset. After modifying some methods to repeatedly activate the stimulus the attraction rates rose to 70%. Furthermore, although 42 of the 102 participants did not detect any of the modifications, they concluded that no technique was truly imperceptible. We can conclude the review of this research direction stating that real-time gaze analysis has been used for guiding the user or providing additional information. However, similarly to view management, most of the current approaches are in a very early stage and in particular their effectiveness when used outside the lab is not very well understood.

3.2.3 Rendering.

As most XR applications primarily target our visual sense, it is natural to exploit the perceptual limitations of our visual system by adapting the graphics according to the location, orientation of the user’s eye and the user’s focus. In this section, while not providing a detailed review, we briefly introduce some research directions of eye gaze applications in rendering and HMD design but refer to details to the work by Itoh et al. exploring latest trends and challenges in AR HMD design [96].

Since the beginning of computer graphics computational speed constraints and pixel density enforced limitation on the quality of presentable computer graphics (CG), which led to the concept of foveated rendering [132]. As humans see only a small portion of the scene in focus, about 5 \(^\circ\) around the center of the gaze, it is sufficient to render only a portion of the CG in full resolution. Foveated rendering is often regarded as a means of achieving wide FOV HMDs without sacrificing the perceived rendering quality [239, 262, 263]. The amount of acceptable foveation hereby depends not only on the selected technique but also the latency of the processing pipeline (eye tracking, rendering, displaying the result). Some results suggest that an overall latency of 50–70 ms may be tolerable [7, 143]. Although the rendering problems have been resolved for desktop systems with more and more powerful GPUs and CPUs, it is still a big problem for HMDs that require a high framerate with high resolution and low latency. With the move toward untethered devices that allow users to explore the virtual environment foveated rendering is getting attention as a way to reduce the amount of data that needs to be streamed from a processing computer to the HMD [144].

Further efforts to reduce the computational demand focus on what users can see in an HMD. Due to the design of current HMDs, users will usually not see portions of the display that theoretically can be left black thus reducing the overall computational demand. The invisible areas vary as the user looks at different areas on the screen, thus a gaze aware restriction of the rendering area can significantly reduce the computational demand [199]. Liang et al. adjust the undistortion parameters of 360 \(^\circ\) views to reduce the amount of distortion around the user’s gaze point [135]. However, in their chosen scenarios they could not show a benefit of utilizing their undistortion approach for users.

Another application of eye gaze in virtual reality is the replication of visual cues, such as the depth of field (DoF) [209]. The generation of gaze-based DoF effects has been shown to improve the realism, the fun factor, and the overall user experience of virtual environments [84]. This concept can also be applied to generate CG that replicate other features of our vision, such as achromatic aberrations [39], resulting in more realistic depth and appearance of CG. While most research focused on virtual environments, replicating the DoF in AR is important to create the illusion that the CG are really placed in the real world [210]. Estimating the focus depth can also be applied to correct unintended out of focus rendering of CG due to a fixed focal plane of most OST-HMDs [44, 184].

When users explore the virtual environment in constrained surroundings, redirected walking can direct them away from the edges creating the feeling that users are in a larger room than they actually are. While only slight rotations of the scene can be done while the user observes the scene, saccade contingent updating exploits our blindness to system changes during saccades. This should allow larger scene modifications without users becoming alert to the change in the environment. While this idea has been conceptualized more than 10 years ago [250], recently it received renewed attention [22, 110]. Bolte and Lappe [22] investigated the noticeability of scene transformations during saccade suppression. They tested different transformations with 10 participants and found that during saccades rotations of up to 5 \(^{\circ }\) and 0.5 m were not noticeable, compared to a threshold of only 0.23 \(^{\circ }\) and 0.02 m during fixations. Marwecki et al. [152] showed that a similar concept can be applied for scene management by modifying elements of the scene whenever the user is focused on a different portion of the environment.

One important consideration of XR experiences is the risk of cybersickness [47], simulator sickness [113], and motion sickness [261]. While sometimes used interchangeably it is important to note that although these share some symptoms, their severity and origin is different [108, 229]. Whilst there are different hypotheses on the origins of cybersickness, such as postural instability theory and sensory conflict theory [127], it is still unclear how to fully mitigate its occurrence. Some works have shown that eye gaze can be used to predict the onset of cybersickness [268]. This information could then be used to adjust the rendered content to reduce the severity of cybersickness [152, 175].

Recently, Liu et al. [141] suggested to identify a comfortable brightness value that balances the visibility of the virtual content and the background by learning user preferences and the corresponding pupil size. They then recover the optimal brightness of the virtual content by measuring the brightness of the scene and the size of the user’s pupil.

Finally, eye gaze has been considered imperative for a variety of recent HMD prototypes and commercial devices [101, 148]. Hereby, the application range of eye tracking can be very vast, ranging from determining what image plane the content should be rendered on to present an improved user experience (MagicLeap One), determining what area user’s see to reduce computations and ensure a consistent image [148], to physically shifting a high resolution inset based on the user’s gaze to reduce computational cost while presenting high resolution graphics in the user’s focus (Varjo VR-2 pro).

In summary, we can see that gaze information is increasingly relevant for complex rendering. If we know where the user is looking, then we can increase the realism of the rendering by approximating visual cues or adjusting the rendering quality to deliver the highest visual fidelity were human vision requires it. Research is already investigating future rendering algorithms and perceptual display technologies for XR that aim to achieve an experience that is visually almost indistinguishable from the reality [96, 264]. However, achieving this requires often computational expensive algorithms and is easy to see the important role of utilising gaze data to reduce some of this additional computational requirements, in particular for less powerful mobile and wearable devices.

3.3 Collaboration

In this section, we discuss research on collaborative real and virtual environments that focused on real-time eye-tracking information. In these types of interfaces, users can communicate with other humans or their computer-graphics representations (“avatars”) or computer-controlled entities (“agents”), while the shared spaces and interlocutors can either be co-located or remote. These environments have in common that they rely on shared social cues for the coordination of human actions with respect to themselves and the environment [41]. The eye-mind hypothesis states that the location of one’s gaze directly corresponds to the most immediate thought in one’s mind [70, 105]. Human gaze thus provides important social cues for establishing common ground in conversations or spatial interaction [25, 41, 65], and establishing situational awareness with respect to the interlocutors and the environment [55, 65], e.g., by creating eye contact, aligning one’s gaze with another’s, or coordinating gaze patterns in multi-party conversations.

The most impactful previous research on eye tracking in this field focused on four general directions. First, researchers utilized eye trackers to address the inherent challenge in shared VR spaces to communicate a user’s eye movements and attention when embodied in the form of a virtual avatar. Second, researchers in XR leveraged eye trackers to make virtual agents’ gaze react and adapt to the user’s gaze and thus appear more realistic and natural in collaborative environments. Third, researchers worked on sharing tracked eye gaze among a distance between workers and helpers in AR remote collaboration setups. Fourth, researchers introduced augmented gaze cues such as gaze pointers or rays to enhance gaze awareness in shared-space collaboration tasks. In the following, we discuss publications in these four research directions in this category.

3.3.1 Eye Movements in Avatar-Mediated Collaboration.

Collaborative virtual environments connect remote or co-located users within a shared virtual space to create a spatial and social context for interpersonal interaction. Users’ body is generally tracked and represented as a three-dimensional avatar, allowing them to turn their head and interact with their body, thus providing different non-verbal social cues additionally to speech. However, users’ eye gaze was traditionally not captured or represented in the form of avatars’ eye movements in such environments.

Vertegaal et al. [253, 254] evaluated the importance of eye gaze and correlations between gaze and attention in multiple highly impactful studies involving virtual avatars and agents. For instance, they showed that gaze is a strong predictor of conversational attention, with a high probability that the person looked at is the person listened to (88%) or spoken to (77%) [254]. They further showed that participants were 22% more likely to speak when an avatar’s gaze was synchronized with conversational attention compared to random gaze, but that the amount of gaze is more important than its synchronization [253]. These results highlight the importance of eye gaze in avatar-mediated communication.

In a highly impactful collaborative effort among three universities with their own CAVE systems, Wolff et al. [265] and Steptoe et al. [238] presented one of the first systems in 2008 called EyeCVE, which used mobile eye-trackers in three separate CAVEs to map users’ gaze to their virtual avatar, thus supporting mutual eye contact and awareness of others’ gaze in a shared virtual workspace. Their system was based on head-worn eye trackers mounted on shutter glasses. Informal user trials suggested that such gaze cues support multiparty conversational scenarios [238], even though the system latency was comparatively high [265].

The researchers later investigated different factors within this and extended versions of this system. For instance, they evaluated the importance of realistic deformations of avatars’ eyelids, eyebrows, and surrounding areas during eye gaze, showing that the added realism significantly improved users’ perceived authenticity but also that the realism made it harder to identify what avatars were looking at, suggesting a tradeoff and potential benefits of more abstract representations depending on the task [235, 236]. They showed for a collaborative puzzle-solving task that tracked eye gaze leads to superior performance compared to gaze models that simulate eye movements based on the user’s head orientation and the environment [189, 234]. They further compared their system to video conferencing and physical co-location as a baseline and confirmed that its advantages compared to video conferencing mainly lie in the ability to walk around naturally and not be limited by a single camera viewpoint, while also pointing out limitations of the head-worn eye tracker system [207] (see Figure 7(b)). They showed in an experiment that tracked eye gaze is essential for users to correctly identify what object a user is looking at in an environment [171]. Steptoe et al. [237] later integrated pupil size and blink rate tracking and showed that such cues in avatar-mediated communication resulted in higher lie detection rates than video conferencing (see Figure 8(b)). Later systems investigated real-time 3D reconstruction of users’ body and gaze from multiple live video streams, highlighting the difficulties in reproducing viable eye movements [203, 211], in particular when wearing shutter glasses for stereoscopic displays [59]. Moreover, researchers investigated related effects, such as Borland et al. [23], who showed that accurate eye movements are important to improve self-identification with one’s virtual avatar, e.g., when one sees it in a mirror, and related body-ownership illusions. Recently, security and privacy of eye tracking information has gained a lot of attention [24, 34, 87, 103]. John et al. [102] evaluated how blurring of the captured eye images to improve the security of the iris biometrics affects the perception of the avatar’s gaze direction. They found that applying a blur of up to \(\sigma =3.5\) did not noticeably affect the perceived movement of the avatar’s gaze while improving the security aspect. In summary, while a significant effort has been undertaken to support eye gaze in avatar-mediated communication with a wide range of display technologies, more research is needed to advance these solutions beyond prototypical states.

Fig. 7.

Fig. 8.

3.3.2 Eye Behavior in Human–Agent Collaboration.

A large body of literature focused on the development of algorithmic gaze behavior models for intelligent virtual agents to make them appear more realistic and elicit more natural responses in human users during human–agent collaboration [4, 30, 42, 46, 69, 109, 130, 188, 220, 255, 273] (see Figure 7). While traditional models were limited in the sense that they did not react to users’ gaze, newer models can incorporate eye trackers to create more natural bidirectional gaze behavior for agents taking into account the user’s gaze.

For example, Bee et al. [18] developed a model for natural eye behavior for virtual agents during face-to-face conversations with a real user. They instrumented the user with an eye tracker and used a dynamic behavioral model to improve the agent’s reactions, e.g., by making the agent avert their gaze when the user stared at them. They further designed an eye behavior model for an interactive storytelling application in which they used an eye tracker to characterize when the user looked into the eyes of a female virtual agent, impersonating her lover [19]. State [230] presented a behavioral model for believable eye contact between humans and virtual agents, e.g., determining whether the agent’s eyes should converge on the user’s left or right pupil (see Figure 8(a)). Vertegaal et al. [254] presented the FRED system, which uses a behavioral gaze model to react to users or agents looking at them, making them listen or talk to the person in line with the conversational flow. Morency et al. [170] presented an approach using eye trackers to generate realistic conversational behaviors for agents with backchannel feedback based on nodding when the user is talking. Andrist et al. [9] introduced a sophisticated bidirectional gaze model, in which an agent provided gaze cues in a sandwich-making task but also elicited and responded to the user’s tracked eye gaze, e.g., by creating eye contact. Kim et al. [114] further looked at eye behaviors that indicate whether users or agents initiate or respond to joint attention cues. Eichner et al. [54] described a system in which users were equipped with an eye tracker to determine their attention and interests when watching a virtual presentation given by an agent. They found that agents were judged as more realistic and responsive if they tuned the presentation to the user’s gaze. Keh et al. [107] developed a behavioral gaze model to improve the effectiveness of sports training with virtual opponents, using gaze to present controlled cues about their intentions. Khokhar et al. [112] conceptualized that a teaching avatar could determine if a student follows the lesson from their gaze and adjust their behavior accordingly.

Caruana et al. [31] investigated the intention monitoring processes involved in differentiating communicative and non-communicative gaze shifts during a search task and found that communicative gaze shifts have an important measurable influence on subsequent joint attention behavior between humans and virtual agents. Krum et al. [117] further applied the approaches to a system involving head-mounted projectors to effectively reduce the “Mona Lisa Effect” that arises when a projected virtual agent appears to simultaneously gaze at all observers in the room regardless of their location.

Similar related research focused on collaboration between humans and robotic agents. For instance, Sidner et al. [223] proposed a behavioral model for a social robot agents that could track a user’s face and adjust its gaze accordingly, and a human-subject study showed that users established mutual gaze with the robot. Chadalavada et al. [33] investigated how users react to different navigation cues projected by a robot and what their gaze can tell about their intended movement direction. Other work focused on robots with gaze behavior models for establishing joint attention, regulating turn-taking, and disambiguating speakers [162, 225].

In summary, gaze behavioral models for intelligent agents have advanced considerably over the last two decades, resulting in a range of sophisticated solutions for selected collaborative contexts.

3.3.3 Shared Gaze in Task Space Remote Collaboration.

While most research on teleconferencing focuses on face-to-face collaboration, a distinct research direction aims to develop systems that help a user perform tasks in the real world with the aid of one or multiple remote collaborators, also called asymmetric collaboration [196, 198, 216, 259]. One of the first systems in this field was SharedView, in which a camera was mounted on the worker’s head, which was then shared with a remote helper who viewed it on a computer screen [90, 120]. A helper can then in turn provide cues back to the worker, e.g., verbally or visually via a HMD, helping them complete the task. Such remote collaboration systems have different limitations, in particular related to the shared view, which alone is not sufficient to inform the remote helper and/or worker about what the other is attending to or looking at.

To address this limitation, different systems and techniques have been presented [14]. For instance, Fussell et al. [62] introduced an early system in which they used a head-worn eye tracker such that the worker’s eye gaze was shared in the form of a pointer in the camera view provided to the remote helper. A study showed mixed results without a clear benefit of eye tracking, which might be because they did not use an HMD for visual stimulus presentation to the worker in their early system. In later work, Ou et al. [185] showed that the worker’s focus of attention can be inferred from the shared gaze points, suggesting advantages of eye tracking for such setups over speech-only communication. In 2016, Gutpa et al. [20, 79] and Masai et al. [153] presented one of the first fully integrated systems in which a user was equipped with a head-mounted display, camera, and eye tracker while a remote helper could see the user’s view and gaze points on a computer screen. Using this system, they showed for a 3D LEGO construction task that the eye tracker significantly improved the users’ sense of co-presence and performance [79]. In their work, the remote helper used a mouse cursor to annotate the shared view for the worker. Chetwood et al. [37] turned this around and shared the remote helper’s gaze with the worker in a DaVinci surgery system, which significantly reduced errors. Wang et al. [259] compared a head and gaze pointer for remote assistance in an assembly task and found that head gaze was more stable resulting in better performance. Later work [128, 196] realized a bidirectional shared gaze interface where both the worker and remote helper could see each other’s gaze points on the shared view. They showed that this mutually shared gaze significantly improved collaboration and communication. In summary, the design space of asymmetric collaboration interfaces continues to be mapped out by different research groups, focusing in particular on different shared gaze cues and their directionality.

3.3.4 Augmented Gaze Cues in Shared Space Collaboration.

Humans are generally capable of inferring visual attention from the direction in which another human’s eyes are pointing, which is important for collaborative tasks when establishing common ground or situational awareness [25, 41, 55, 65]. However, different factors can reduce the effectiveness of such natural human gaze cues such as wearing glasses, turning away from the observer, occlusion with scene objects, or the presence of distractors like other humans. Researchers thus tried to augment the natural human gaze cues using artificial visual gaze information [67].

For instance, Vertegaal introduced a gaze pointer in the GAZE Groupware System by drawing a circle around the target where users in a shared virtual environment were looking at, and discussed its benefits in establishing who is talking about what in cooperative work [251]. This target circle indicated the point of regard similar to a laser pointer used in presentations. In a related approach, Duchowski et al. [50] introduced a colored “lightspot” as a visual deictic reference in collaborative spaces, indicating the point the user is looking at. They compared eye-slaved and head-slaved lightspots that illuminate the target in the direction their eyes or head are facing, respectively, and found that eye-slaved lightspots help disambiguate the deictic point of reference. Similar findings have also been made by Špakov et al. [257]. Luxenburger et al. [146] further communicated the person’s visual field via colored elliptic shapes. Piumsomboon et al. [195] presented the user’s total visual field as a frustum as well as the gaze direction as a ray. Sadasivan et al. [212] combined gaze rays with a colored target dot in a collaborative training environment. In later research [163], they extended the system with a decaying trace stimulus, which provided a brief positional history of the sequence of target dots that faded out over 200 ms. They further introduced a semi-transparent cone-shaped ray, which extended the gaze ray by communicating the direction of the ray from the user’s head to the target. They compared the stimuli and found that the decaying trace performed best for a collaborative inspection and search task, compared to a single target dot or ray. Rahman et al. [204] suggested different cues, such as trails, arrows, and highlights, to communicate a learner’s gaze to a supervisor.

While previous research mainly focused on virtual environments, Norouzi and Erickson et al. [56, 57, 180] evaluated the effectiveness of sharing gaze rays between two interlocutors in an AR environment (see Figure 7(c)). Their task consisted of identifying a target among a crowd of people based on another person’s gaze rays. They simulated different limitations of AR shared gaze setups including factors related to the eye tracker (accuracy and precision) and the network (latency and frame drops), and they identified subjective and objective thresholds for acceptable performance.

Hosobori and Kakehi [88] investigated non-visual gaze cues to augment shared space collaboration. They introduced a technique called Eyefeel, which converts and delivers the gaze of another person as tactile information, and EyeChime, which converts events such as gazing at another person or eye contact to sound.

In summary, augmenting shared gaze cues has shown promise for enhancing collaboration in different application contexts, but more research is needed to explore and evaluate the approaches.

4 Recent and Future Directions

In this section, we extrapolate the insights about previous research trends and directions in the XR field to the future. Recent advances in gaze input and user interfaces are largely fueled by continuing improvements of the base technologies related to eye trackers, gaze estimation algorithms, and their display integration. We expect these improvements to continue over the next decade, resulting in eye tracking becoming available to the broader research community and ubiquitous in the head-worn display market. We are already seeing some new hardware approaches that have a lot of potential to mitigate the issues that held gaze input back in the past, such as the low angular accuracy. Examples are the infrared mirrors employed in the VIVE Pro Eye, and sensor fusion with camera-based eye tracking and electrooculography [22, 49] as well as other non-infrared based eye tracking technologies [215]. In the following, we discuss some of the more prominent trends and directions for gaze input in the field of XR.

4.1 Explicit Eye Input

The majority of the publications in the area of explicit eye input focused on studying various approaches to facilitate general interaction in XR, more specifically selection, manipulation, and navigation tasks. Other fields of applications that were studied are accessibility, daily tasks and entertainment, healthcare, telepresence, and military fields. This trend is understandable, considering that the general interaction techniques can be re-purposed to support various specific applications. We assume that this trend will be further integrated into XR experiences in the future, expanding the interaction space, for example by reaching out to distant objects that are not accessible using traditional interaction methods. Gaze input was shown to provide a suitable interaction technique especially for the people with disabilities, where it can serve as a substitute for traditional hand-based interaction techniques [80, 138, 147, 154, 266]. This community can especially benefit from systems and eye trackers getting more easily available. Application areas are diverse and include navigation, control of extra limbs, or simply enabling access to general interfaces by providing gaze-based interactions at a larger scale.

A limitation that pervasively exists in the reviewed literature is the lack of a baseline for evaluation. As introduced in Section 3.1, various gaze-based targeting techniques have been developed, but the evaluation of them is conducted in different settings, including the evaluation tasks, subjective and objective metrics. The lack of common ground in the evaluation leads to diversified understanding within the community. For instance, Blattgerste et al. found that gaze-based interaction is more accurate than head-based interaction [21], but Kyto et al. came up with the opposite conclusion [122]. Overall, a similar pattern is observed with many of the eye-only interaction comparative studies on a number of factors, such as interaction speed and accuracy [38, 58, 100, 167, 201]. It is a definite future need to develop a set of common tasks and evaluation metrics, to compare the performance of different interaction methods, and to ensure the repeatability of evaluation results. Such efforts can help clarify questions such as, in what cases do we really need eye tracking, and when can we substitute/approximate it with head direction? Separately, more standardized evaluation methods can shed light on the contribution of each modality to user’s performance and comfort when multi-modal approaches are utilized. Therefore, researchers and developers can pick from a menu of modalities based on the needs of their application and their target population.

As discussed in Section 3.1, dwell time has been a popular approach for target selection and manipulation in many eye tracking applications both for 2D and 3D interaction spaces [68, 74, 138, 178, 205, 256]. Although popular, this approach cannot entirely resolve the Midas Touch problem and is not the most efficient. We noticed the development of novel approaches such as half-blink detection and gaze gestures, aimed at resolving the slow interaction times and potential incorrect selections [63, 85, 111, 129, 247, 266]. Still, further research is required to understand the performance benefits of these novel approaches in comparison with each other and the type of tasks that are better facilitated by these methods. Also, we identified opportunities for further research in understanding the performance and usability benefits of these novel approaches compared with current multi-modal techniques and the impact of user profile and task type for utilizing either multi-modal approaches or dwell time alternative methods.

Last, we could not identify any longitudinal investigations on the usability of eye-based interactions and their long-term effects on users’ behaviors and preferences based on the papers reviewed under explicit eye input in Section 3.1. Due to the limited availability of mixed reality systems equipped with eye trackers in the past, long-term studies were very difficult to conduct. One challenge that we foresee for future research is the scalability of gaze-based interaction techniques. For now, studies are usually conducted in very limited, mostly laboratory, settings. It is unsolved how gaze interaction techniques perform in less-restricted circumstances. Also, similar to traditional interaction methods, gaze-based interaction methods produce fatigue, which is minimally considered in the literature we reviewed (e.g., see References [21, 38, 201]). Another challenge that has to be addressed is visual discomfort that is produced by mixed reality glasses. It has to be discussed how the use of these devices influences visual comfort and well-being. One prominent example is the vergence-accommodation conflict. Future projects should investigate how the decoupling of vergence and accommodation responses influences our visual system and what solutions there are on the interaction side, besides technical ones.

4.2 Implicit or Adaptive and Attentive User Interfaces

We support the idea that for a continuous use of an XR interface it has to adapt to the user’s context and we think that human gaze can play an important role. However, from the literature we see that we are only at the beginning.

As such, our review identified a lack of research on adaptive and implicit interfaces compared to interfaces that utilize gaze for explicit interaction. One can also argue that current approaches are relatively simple demonstrating the early stage of this research direction. This is because (1) the models used for computing the context based on gaze are simple and (2) because the chosen context targets are only a subset of possible targets for adaption. For example, the work by Lindlbauer et al. [139] is important as it makes first steps but only considers cognitive load as context source and information placement as context target. It is easy to see that a continuously used XR interface might consider other sources and targets.

Going further we expect to see more works targeting XR interfaces by using more complex models for context recognition. There are different approaches that explore activity recognition (e.g. reading) based on gaze data and electrooculography (e.g., References [29, 91]). Similarly some works explored the identification of the onset of cybersickness [268] from changes in the user’s gaze patterns and approximating the mental state of the user has been suggested as the ultimate goal of several works that focus on explorations of eye gaze behavior [2, 32, 36, 83, 83, 86, 274]. Although those ideas are appealing so far they have not been explored for adapting an interface. It remains to be seen how this can be accomplished for XR and how well it is received by end-users but we definitely see this as a trend in adaptive XR interfaces.

We also realised that so far many works adapt existing solutions from 2D interfaces to XR [119, 183, 249] but do not fully reflect on the 3D nature of most XR interfaces. We expect that these adaptations will continue but need to consider how the 2D modes can be expanded to the 3D environment in XR. Thus far we have seen few methods that go beyond blending in and out of different layers based on the user’s current focus distance [131, 191, 248, 249]. We have also identified different approaches to manage the presented information in consideration of the background environment, however, these were only focused on 2D labels rather than more complex 3D objects commonly found in XR [159, 160, 161, 174]. With virtually unlimited space to place content in the user’s surroundings content management information overload and clutter become a significant concern [60, 64, 244]. We expect that techniques that modify the arrangement, placement, and visibility of virtual content will gain importance. We have also observed increased interest in gaze guidance in XR environments [52, 75, 76, 157, 158], which is especially of interest for the emerging XR entertainment industry. We expect interest in this area to continue in the near term, potentially with expansions into environments that adapt to the user’s gaze. Here, we expect the research to incorporate findings from the collaborative work and interactions with virtual avatars covered in the next section.

All this also requires more studies that are carried out over an extended period. So far we identified this as a common limitation of ideas explored in the reviewed papers. Many works focused on presenting a prototype system to showcase the underlying idea, without thorough evaluation or even missing evaluations with actual users. We also found that very few papers compared the developed prototypes with other interaction methods and scenarios.

4.3 Collaboration

In the field of avatar-mediated collaboration, we are seeing an increasing trend that XR developer communities make use of the eye trackers integrated into XR HMDs such as the VIVE Pro Eye, FOVE, or HP Omnicept, and add-ons from Pupil Labs, SMI, and Tobii. Based on the body of literature discussed in Section 3.3 that showed clear benefits of tracked self-avatar eye movements for virtual collaboration (e.g., References [171, 189, 234, 253, 254]), we expect this to become standard for social multi-user XR platforms in the near future. We expect the related practical challenges with respect to eye models for rigged avatar characters to be largely resolved over the next years. In the mid-term, once eye tracked self-avatars become more common, we predict that more research will focus on documenting the occurrences of social miscommunication and its causes in collaborative virtual environments due to gaze-related latency. We believe that this will be accompanied by more system/algorithm-oriented research focusing on means to reduce gaze latency in XR, such as eye trackers with higher frame rates and eye motion prediction algorithms to reduce the effects of network latency. Moreover, we see more and more research focusing on the subtle information conveyed by the eyes in conjunction with the surrounding facial muscles, such as discussed by Masai et al. [153] in their “Empathy Glasses” prototype, and recently integrated in the commercial HP Omnicept HMD, which tracks the user’s facial muscles together with gaze directions and pupillometry. In the long term, we see some very interesting research becoming possible when macro- and micro-expressions can be tracked in real time, represented and rendered in real time, manipulated in real time, and effectively leveraged and employed during face-to-face conversations in XR in the future.

For human–agent collaboration, we predict continued efforts toward realistic eye behaviors of virtual agents in contexts such as education [112], training [107], and entertainment [19]. We expect that one of the major fueling factors will be the increased availability and use of eye trackers throughout our society, which will provide the opportunity to collect larger annotated data sets of natural eye movements that can then be leveraged to develop effective machine learning solutions for this classical challenge [115].

In the direction of shared gaze and augmented gaze cues, either in remote collaboration or with co-located users, we predict an increasing amount of research interest. In the near term, we see some natural extensions of current research trends that become possible due to improved AR scene understanding [1], e.g., allowing visual deictic references to be extended with knowledge about automatically classified scene elements and related semantics. Such extended approaches could go beyond communicating a point in space (“Look there!”) to a richer and more nuanced non-verbal human gaze expression and communication, including emotional influences and gaze-directed attentional cueing (e.g., “I like that!”). Also, new SLAM-based scene mapping methods in AR could improve the performance of shared gaze cues (such as rays or points) to that previously shown in VR, including natural occlusion from a user’s point of view, which so far has not been possible and resulted in lower performance of such cues in AR compared to VR [56]. We also see some interesting extensions in the use of multimodal and non-visual cues in shared gaze environments, with initial work by Hosobori and Kakehi [88]. Last, we also believe that these approaches could be extended to a more general theory of interpersonal attention and emotional processing, with implications for understanding how social referencing is impaired in autism and other disorders of social cognition [71], as well as an improved cross-culture understanding of gaze behavior [116]. Related methods could potentially compensate for such effects using AR enhanced/translated cues. For instance, AR cues could support persons on the autism spectrum to make eye contact or provide visual or non-visual cues about others’ social referencing.

5 Conclusion

In this article we report on our review of gaze-based interfaces in XR environments. We reviewed papers from a wide range of journals and conferences indexed by Scopus, resulting in overall 215 papers from 1985 to 2020 that utilized eye gaze. We identified three emerging areas that utilise gaze in XR, namely explicit eye input, adaptive and attentive interfaces, and collaboration in XR. Our results show that especially in recent years the number of papers that incorporate eye gaze as some sort of input or system parameter has been significantly increasing, with previous concepts being rediscovered with the improved accessibility to hardware that incorporates eye tracking capabilities. However, while we believe in the potential and relevance of the identified areas that emerged, we also showed that each area is probably just in the beginning with explicit gaze input probably best explored. An example is the need for context-aware user interfaces for XR that could utilise gaze information to sense the user context and mental state. While the potential has been recognized, actual works demonstrating the actual use in an XR context are rare. We furthermore found that in many cases eye gaze has been incorporated into prototype systems but identified a significant lack of comparative studies. In some cases, we also found contradicting results without a clear consensus.

As with every work our approach also has some shortcomings. There are the general search terms and database used that still leaves the chance for relevant papers being missed, because they do not utilise the wide set of search terms used in our search. Furthermore, considering the large number of papers focusing on eye tracking in XR that appeared in our search, we decided to adjust our review solely to cover gaze-based interactions to allow for a deeper exploration of the topic, which is also aligned with two of the applications of Majaranata and Bulling’s eye tracking continuum [149]. However, deeper investigations of other applications identified within the eye tracking continuum (i.e., Gaze-based user modeling [104] and Passive eye monitoring [149]) and beyond it (e.g., privacy and security [106]) is vital to form a coherent picture of the trajectory of eye tracking research in XR. For instance, the area of privacy and security has attracted a lot of attention with the increased popularity of XR technology in the consumer market that is capable of tracking a wide range of users’ behaviors and expressions [13, 186]. One of the implications of collecting and processing this wide range of data, such as gaze data, is the high probability of identifying users without their knowledge and various researchers has been exploring solutions to maintain user’s privacy when eye tracking data is involved [6, 24, 34, 87, 103, 136, 137, 140, 165, 166, 208, 232].

Finally, large parts of this work focus on the review and discussion of general directions observed within the field. This is a consequence of the wide utilisation of gaze in XR. We do not focus on a fine grained analysis of trends but rather focused on the overall picture. This, however, leaves room for future work and here in particular in the field of explicit input using gaze data where we see the potential for a more focused survey or review that also takes a more detailed look at the results from user studies to put them in context.

Footnotes

https://www.getfove.com/.

https://www.hp.com/us-en/vr/reverb-g2-vr-headset-omnicept-edition.html.

https://pupil-labs.com/products/vr-ar/.

⁴

https://www.ultraleap.com/.

References

[1]

S. Aarthi and S. Chitrakala. 2017. Scene understanding—A survey. In Proceedings of the International Conference on Computer, Communication and Signal Processing. 1–4. DOI:

Abstract

1 Introduction

2 Methodology

3 Research Topics and Directions

3.1 Explicit Eye Input

3.1.1 Eye-Only Interaction.

3.1.2 Multi-Modal Interaction.

3.2 Implicit or Adaptive and Attentive User Interfaces

3.2.1 Information Management, Spatial Presentation, and View Management.

3.2.2 Information Management and Visual Presentation.

3.2.3 Rendering.

3.3 Collaboration

3.3.1 Eye Movements in Avatar-Mediated Collaboration.

3.3.2 Eye Behavior in Human–Agent Collaboration.

3.3.3 Shared Gaze in Task Space Remote Collaboration.

3.3.4 Augmented Gaze Cues in Shared Space Collaboration.

4 Recent and Future Directions

4.1 Explicit Eye Input

4.2 Implicit or Adaptive and Attentive User Interfaces

4.3 Collaboration

5 Conclusion

Footnotes

References

Cited By

Index Terms

Recommendations

Eye&Head: Synergetic Eye and Head Movement for Gaze Pointing and Selection

Gaze + pinch interaction in virtual reality

Eye vs. Head: Comparing Gaze Methods for Interaction in Augmented Reality

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

HTML Format

Get Access

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations