As a first step towards introducing flicker for use in AR glasses, an understanding of how it compares to current approaches is required. Prior explorations of flicker have focused on the generation of potential techniques and comparisons against other approaches are limited, and particularly lacking against saliency. Further, AR glasses introduce constraints on the manipulations possible due to device capabilities. Therefore, we wanted to evaluate techniques as they can actually be produced in AR glasses; earlier work has either worked directly on screen-based content or VR views.
Controlling for confounds such as the position from which the user views the scene, and any variance in this due to motion enables a high degree of internal validity. This is commensurate with the comparison between techniques we wanted to conduct at the cost of some externalisation where wider task and user action will influence effectiveness.
3.1 Study Outline
Conditions: We choose techniques that represent different characteristics within a conceptual design space. We eventually settled on the following techniques: Visual guidance using geometrical cues, visual guidance using inherent saliency modulation, and finally, visual guidance using a flicker cue. Figure
1 gives an overview of the techniques implemented for our study and the effect of overtness variance.
For a technique using
geometric cues, we chose outlining [
35], in particular, the variant using halos/circles [
45,
51]. This provides an easily implementable method that has been demonstrated to work in OSTHMDs and is similar to several other methods proposed in the literature [
23], including frame effects [
40]. This technique also avoids known problems that arise when using arrows [
40] while also reducing the occlusions caused by using arrows and dots as geometrical primitives. In our implementation, we utilised a white circle encompassing the target area. To vary the overtness of the geometry, we adjusted the opacity of the circle.
For a technique demonstrating visual
guidance using saliency, we used the recent techniques by Sutton et al. [
45] as it was, to the best of our knowledge, the only saliency-based guidance techniques explored for OSTHMDs and Augmented Reality. The technique modulates contrast and saturation, increasing it in a target area while reducing it everywhere else. We adjusted the technique by using the direct outlines of the target areas rather than a blurred circle to reduce the geometrical impact of the technique. The original paper explored the parameter space to adjust the overtness of the saliency modulation. We did not apply a per-component-based optimisation but adjusted the levels of each component uniformly to change the overall overtness.
For a
flicker technique, we were inspired by screen-based information visualisations [
51]. They are well documented and do not require eye tracking as other approaches used in perceptual studies [
4,
29]. Similar to the original implementation for standard displays, our implementation does not rely on achieving high-frequency flicker at critical flicker frequency (CFF) as it usually cannot be achieved with current head-mounted displays, nor is the CFF consistent for all users and viewing environments. Instead, our technique briefly shows a high-frequency flicker before transitioning to a low-frequency low-intensity flicker. Besides implementing it for use in HMDs we modified the technique by adjusting the shape of the luminance adjustment (the area of flicker) to match the shape of the target. To vary the overtness of the techniques we adjusted the time spent at the various flicker frequencies.
Task: The task for this study was to view a series of images. Participants were informed that we record gaze data of people viewing a set of images, some of which had been modified, and encouraged to explore the images. They were also made aware that we would be asking questions about the images afterwards.
Design: We designed a within-subjects study to investigate and compare the effectiveness of guidance techniques at different levels of overtness. We evaluated the effectiveness of the techniques using a set of real images which participants were asked to look at whilst their gaze was recorded. Our independent variable was the method of guidance provided (
None, Geometric, Saliency, Flicker). We collected results for each method at four different levels of modulation overtness (
25%, 50%, 75%, 100%) to evaluate their relative effectiveness at different levels. Examples of each technique applied at each level can be seen in Figure
1. We looked at the
time to first fixation and the
area of the image explored as the dependent variable.
Subjective overtness noted by the participants was recorded on a seven-point schematically anchored scale.
Apparatus: As a primary goal of this study was to explore the use of cues in AR glasses, we required users to see the guidance provided as seen through the glasses. The limited luminance range of displays, and non-linear gamut, and additive nature will affect the perceived flicker. Similarly, the saliency modulations producible are constrained to additions to the scene only, and contrast constraints will impact the visibility of outlines. However, collecting reliable gaze data in AR displays is challenging as the quality of eye trackers varies, and data access is limited when compared to eye trackers traditionally used in research. As we faced similar challenges and were interested in reliable results (internal validity), this study used a study apparatus initially proposed by Sutton et al. [
45]. The key idea is to capture the view through an OSTHMD with a camera, in our case a Sony A7M3, and present this view in a VR display with an integrated eye tracker (HTC Vive Eye Pro, see Figure
2). The image dataset presented to the participants consisted of 80 images selected to represent a range of real-world scenarios in which visual guidance may be implemented. We included both natural scenes and man-made structures. We split the images into three groups based on their image saliency (High, Medium, and Low) as given by a commonly used saliency estimation predictor [
8]. We then assigned each image to a desired level of modulation from 1 (minimal modulation) to 4 (maximum modulation). This resulted in a dataset of images comprised of 80 images divided into sets of 20 with an even distribution of inherent saliency spread across various real-world scenarios.
Based on the generated saliency map for each image, we also selected one object or area to be modulated. These objects were selected as places that were expected to see little but some attendance by viewers in the unmodulated condition. Areas were selected based on the expected saliency and the objects contained were random. Therefore, top-down processing may have introduced some variance into the degrees of attention applied, particularly under the baseline condition that would reflect natural viewing. All images had higher saliency areas which were expected to initially draw attention in the limited viewing time, and we used a counter-balanced design to mitigate any effects. We choose not to use videos in our dataset as they would introduce additional confounding factors due to motion in videos, which serve as additional salient cues.
Procedure: Participants first signed a consent form and completed a demographic survey (age, gender, if they were colourblind, any other uncorrected visual impairments). Then they put on the headset. We ran the eye tracker calibration, which was verified to be within 1o. If the error exceeded this threshold, the calibration routine was rerun. Once the calibration was verified, the participants were shown a white cross at the centre of the virtual screen and instructed to focus on it. After 3 seconds, the cross was taken away, and the participants were shown an image for 5 seconds. This was followed by showing a black screen with a question regarding the perceived obtrusiveness of the modulation using a seven-point semantically anchored scale with labels of 1: Very Subtle and 7: Very Overt. They were also given option 0: No modulation. After participants answered the question, the cross was shown again. This procedure was repeated for all images in the dataset with each image being modulated by one of the techniques at a given level. The level of modulation being applied to the images was set using double Latin squares to compensate for ordering effects. The image and technique order within each level were randomised. After viewing all images, the participants were given a break and a chance to remove the headset before continuing the study (after calibrating the system again). The participants were then shown one of three unmodified images (either low, medium, or high saliency distribution) and all modified versions of the image at one modulation level and asked for any further comments regarding the techniques. Next, this was repeated for all levels of modulation. This study was approved by the institutional ethics committee.
Participants: We recruited 28 participants from around the campus (7 female, 21 male, mean age: 24.5, sd: 6.2). All participants could calibrate the eye-tracker sufficiently and took part in the interview.
3.2 Analysis
Analysis: For statistically analysing the results, we used a significance level of p < .05. We calculated fixations using the IV-T fixation detection algorithm [
31]. To determine if a fixation was within the target area, we tested if any fixation point lay within the area denoted by the outline, allowing for an error of 1 degree. With this we determined if a user
fixated (F) within the target area for each image, and, if so, the
time to first fixation (TtFF). If the participant did not look at the target area, we set TtFF to the maximum time (3 sec). This assumes that for all images and techniques, the participant would have looked at the target area immediately after the time shown. Whilst this is a false assumption, it is equally applied across all conditions and allows us to run statistical tests on a complete data set. This is a very conservative approach as we would expect actual values for fixations to show greater variance and actual p-values to be smaller than those found. We also analysed the
area of exploration (AoE) and the
duration of time (D) spent fixated on the target area. Examples of the area of images explored by participants can be seen in figure
3.
We applied Friedman’s test (Degrees of freedom: 3) and a Wilcoxon’s paired test (Degrees of freedom: 27) with Holm-Bonferronni correction as for all measures with non-parametric data. In the following, we report on the relevant statistical results relevant to our main research goals. Further details are provided in the appendix.
Results. Looking at the TtFF (Figure
4 Left), we can see that above 25% flicker enabled the fastest fixations except when compared to geometric cues at 75%. At 25% flicker was only significantly different –and was slower than– geometric guidance (p < .0001), which was also significantly faster than none (p < .0001) and saliency (p < .0001). Above 25% we see that flicker provides a significantly faster time to first fixation (all p < .0001), indicating that it was able to provide effective guidance in this regard. In fact, a significant effect for all techniques was found when compared to the baseline condition, indicating an improvement in effective guidance. Looking at the guidance techniques over 25%, compared to saliency-based guidance, we see that flicker is consistently significantly faster to draw gaze (all p < .0001). Compared to geometric guidance, flicker was also able to draw fixations significantly faster at 50% (p = .0022) and 100% (p = .00355), but not at (75%: p = .074). Overall, we see that once the initial high-frequency component was present, flicker proved effective and the fastest of the tested techniques. The mean values and standard deviations for each technique at each level can be seen in figure
5 (Right).
Looking at the AoE (Figure
4 Right), we can see that there is a similar trend to TtFF with flicker being the most limiting on AoE above 25%. Geometric guidance was the only technique to show a significant effect compared to the baseline None (p = .0048) and subsequently also flicker (p = .0048) and saliency (p = .0048). Again, we see a significant effect for flicker compared to the alternative techniques. In this case, in all instances above 25% flicker was significantly different, leading to an overall decrease in image exploration among participants (all p < .0001). This indicates that once a gaze is drawn to a target by flicker, the amount of continued exploration decreases compared to the alternative tested.
Looking at the
overtness (Figure
5 Left) of the conditions we see that flicker was considered the most overt in all conditions above 25% (None: all p < .001; Saliency: 50% p = .00024, 75% p = .00075, 100% p = .03435; Geometric: 50% p = .00636, 75% p = .00355) except when compared to Geometric guidance at 100% (p = .12407). This generally shows that flicker provides a very noticeable effect compared to the other conditions.
We allowed participants to note if they did not perceive any modulation (
modulated noted) and tested this. We also tested whether or not participants
fixated on the target area based on previous work [
45], and the
duration of the fixations on the target areas. The results for these can be seen in Figure
6. The notability of modulations was considered significantly different across all conditions and levels of modulation, in line with the differences in
subjective overtness.
Fixations followed the same pattern of significance as TtFF save for flicker compared to geometric at 100% where neither was significantly better than the other (p = .0726). The duration of fixations followed the same results as
AoE, with flicker producing significantly higher fixations over the conditions above 25%.
Key points raised in Interviews. The participants raised several interesting points when discussing the techniques at different modulation levels. First, participants liked the ability of flicker to provide precise guidance without obscuring the shape of a target but found the highest overtness level of flicker too intense. Participants noted the difference between the highest levels of flicker (100%) and the 2nd highest (75%) that caused the flicker to switch from constantly fast to a slower frequency. Second, a third to a half of the participants generally negatively commented on saliency modulation. While participants appreciated the reduced obtrusion of using the saliency to provide guidance, levels increasingly washed out the image, which was perceived as an undesired filter. The effectiveness of the alternatives was given as a reason for saliency not being preferred. Third, geometric outlines were often preferred, although not always considered effective. Initially, it was disliked due to the effect being too subtle, although at higher levels, issues arose with it standing out too much from the image. It was also noted that it was making it difficult to understand what was being highlighted.
Generally, the anecdotal feedback showed a preference for effective and clear techniques to provide guidance. However, this comes with the downside of reducing the overall viewing experience. This led to a dislike for saliency, and concerns with overly overt outlines, and extend periods of high frequency flicker.
3.3 Discussion
One immediate thing to note from our results was that at the 25% level, only the geometric modulation effect had any perceivable effect and was the only technique to affect their gaze patterns. It is also the only modulation technique that was noticed, though the noticeability of modulations even in the None condition was rated at 50% on average indicating a lot of false positives. This might be explained by the fact that even with no modulation (None condition) the participants had an OSTHMD in their optical path which could have caused some effects not perceived as modulation and should be considered when interpreting the other results.
While all techniques except geometric modulation seem to not be effective or even perceivable at low overtness levels, they showed a similar ability to provide effective guidance compared to baseline at all other levels. With respect to the flicker modulation, we can see that in all levels where there is an initial fixation component to the flicker effect (levels 50%, 75% and 100%), it provides the fastest TtFF and greatest F, outperforming traditional geometric cues.
Whilst being effective, flicker showed a higher tendency to cause attention tunnelling when compared to the geometric technique, a concern that needs to be considered when applying it in real-world scenarios. To avoid this, saliency modulation appears to be the best option, minimising the time the target is fixated to that necessary to identify it and directing attention faster and more consistently than the unmodulated condition. However, saliency modulation did not achieve the speed or high chance of fixation that the other techniques achieved. When looking at saliency, we can again see that precise calibration is needed. Without using calibrations of the parameters, lower modulation levels showed no significant effect with either no differences noted or a subtle shift that did not make a clear target stand out, whilst at higher levels the participants found the washing out of colours undesirable. Based on the participants’ comments, tuning saliency techniques towards directly increasing the saliency of the target, with limited reductions in the surrounding environment, may be preferable.
We also found that once a drawing time was introduced into the flicker effect, it was found to be overt and participants thought it could become annoying. Turning off the effect before it can be viewed, as demonstrated in other works [
4,
29], is a potential means to alleviate this. However, this may impact the ability of user’s to confirm the target of guidance, as works have assumed a need for subtle effects that are not directly viewed, and only evaluated overtness by varying the size of modulated areas [
29].
One aspect to consider from our results is the need for tailoring algorithms to both user and context. Whilst there is also the confounding variable of interpretation of the question, we can see that geometric guidance was often preferred in the subjective interview. However, this was not always the case and for some participants, it was not even the optimal guidance. We believe that tailoring modulations to the needs of an individual user will be an important step forward in the development of further methods for visual guidance. Furthermore, the need for context-based modulations when applying saliency and geometric cues is apparent. We can clearly see in some images where the generic application of modulations can have little to no effect, for example, those where the target is a light area surrounded by further light areas, and those where the effect is quickly evident, for example targeting a dark area surrounded by further dark areas causing modulations to create a quick transition (Figure
7). This need for context-aware and adaptive highlighting techniques was already reported elsewhere [
41] and our results indicate that flicker is another technique to add to the existing repertoire of techniques for guidance that can be applied in-context.
We also see that flicker and geometric outlining techniques both provide an effective means to quickly draw attention to a target area and hold attention there. Notably, flicker appears to be the most effective at this. Saliency was still able to effectively draw attention when compared to no modulation but fixations were generally slower and saliency could not maintain attention as well as the alternatives. This would indicate that saliency techniques may indeed be best left to utilisation in their currently indicated application areas of subtle, less obtrusive, and more scene-preserving methods of visual guidance.
The results from the study indicate the potential of the flicker technique for visual guidance to provide effective forms of visual assistance to AR glasses. However, results also indicate that modulating flicker is needed to prevent it from being overt and to reduce the attention tunnelling seen in geometric cues.