1 Introduction
Virtual reality (VR) is increasingly being used to teach people new skills in the workplace by immersing them in realistic training environments [
129]. Advances in commercial head-mounted displays (HMDs) make VR a convenient and cost-effective alternative where training would otherwise be high-risk, dangerous or expensive [
37]. Training in VR can be more enjoyable than its real-world counterpart and has been shown to be as [
62], and in some cases more effective for learning [
51]. As a result, VR training is being deployed across a wide range of industries including healthcare, nuclear, transportation, and aerospace [
37], and analysts have predicted VR training has the potential to boost global GDP by $294.2 billion by 2030 [
98]. The overwhelming majority (≈ 78%) of industrial VR training use cases involve procedural and psychomotor skills [
101], i.e., skilled movements that require coordinated motor action and cognition [
107]. The most common skills are manual tasks that involve learning a procedure or sequence of actions that require the user to grasp and manipulate objects, such as construction [
2,
4,
11,
88], dental [
84] and surgical [
12,
46] procedures, equipment operation [
31,
83,
99,
117], and tool use [
92].
One of the advantages of VR training is that it offers a platform for ‘hands on’ learning. As a result, experiential and constructivist learning theories are the most commonly implemented frameworks when developing VR training [
101] and the predominant way to teach skills in VR is to utilise active learning approaches, i.e., ‘learning-by-doing’. These approaches guide an individual to complete each step of the procedure while learning the skill through rehearsal [
2,
12,
16,
31,
84,
95]. Physical practice is essential to psychomotor learning which occurs in three stages: cognitive, which involves remembering and understanding the required skill; associative, where the skills are refined through rehearsal; and finally autonomous, where the skill is automatically replicated with maximum efficiency and minimal conscious effort [
29]. Research has shown that virtual practice can prepare individuals in the same way as real-world practice [
82] and that active learning in VR can be highly effective compared to traditional real-world training, especially when learning a manual assembly task [
2,
3,
35,
51,
62].
Despite the emphasis on ‘learning by doing’, there is evidence to suggest that in the early cognitive stages of learning [
29], observing a demonstration is beneficial [
6,
26]. This is thought to be true especially when the demonstration involves a human model because it leverages an innate bias to ‘learn by watching’ others which is less cognitively demanding [
26,
73,
90]. Observational learning (‘learning by watching’) is an effective and commonly used approach for acquiring psychomotor skills outside of VR [
18,
40,
45,
59,
64,
102,
118] which has been shown to provide equivalent outcomes to active learning whilst offering potential benefits of time efficiency [
34,
70,
131] and reduced cognitive load [
119,
126]. In particular, combining observational learning and physical practice is thought to be one of the most efficient and effective approaches to real-world training [
6,
59,
102]. VR training accompanied by observation of a real-world demonstration can be more effective for learning manual tasks than VR training alone [
116], however, it is not always practical to watch a real-world demonstration and it is unclear whether observation can be effectively integrated in VR.
Observational learning theories predict that model similarity impacts learning effectiveness [
10,
23,
24]. One theory is that the action-observation network and overlapping mirror neuron system, which enables us to learn by imitating others, fires more as the similarity between ourselves and the model we are watching increases [
7,
20,
23,
67]. The highest degree of model similarity is achieved using a self-model, and prior work has demonstrated that video self-modelling can be used to enhance manual fine psychomotor skills (e.g. in a cup-stacking task [
43] or playing a video game [
60]). Existing research into whether this phenomenon occurs using avatars has shown that learning is more effective when the avatar demonstrator is made to look increasingly similar to the user, either by matching just the skin tone, hair colour, and gender [
27] or by using a photorealistic avatar [
33]. However, this has only been shown in the context of gross psychomotor skills, and understanding whether self-model avatars have the same advantages as video self-models for fine psychomotor learning is an important design consideration for a wide range of VR training applications that could utilise the observational learning paradigm.
While both active and observational learning approaches have merits, the current VR literature is dominated by active learning approaches and it is not well known how ‘learning by watching’ an avatar might compare. While comparisons between active VR training and alternative real-world training approaches (e.g. instruction manuals, videos, or AR) are common [
3,
34,
35,
51,
82,
89,
100], there are few comparisons of different learning approaches within VR training [
57]. Additionally, it is difficult to retrospectively compare different learning approaches from the VR training literature because very few applications expose the learning theory or approach that has guided their implementation [
101].
To effectively compare different learning approaches in VR, it is important to assess if skill transfer to the real world has taken place [
14,
69,
82,
92]. Skill transfer in VR is commonly assessed using ‘near’ transfer tasks where the real-world task is identical to the one experienced in VR [
2,
82,
125]; however ‘far’ transfer tasks, where the taught skills are applied in a dissimilar real-world context [
39,
78], are under explored.
We aim to address these research gaps relating to the relative efficacy of observational learning in VR and the role of avatar similarity in learning manual psychomotor skills by answering the following research questions:
RQ1
How does active learning compare to observational learning of a fine psychomotor task in VR?
RQ2
How does active learning compare to observational learning when transferring skills to the real world?
RQ3
How does demonstrator similarity affect observational learning?
We begin by presenting an interview study (n=22) with a range of industry stakeholders who have experience designing, developing, delivering, or using VR training. This provides supporting evidence about the prevalence of active learning and importance of skill transfer in industry VR training applications, which complements prior reviews of academic literature [
37,
101,
129], and leads to design considerations for how to approach VR training in our study. To address our research questions, we conducted a between-subject user study (n=102) over three sessions, which compared the effectiveness of active and observational learning of an assembly task in VR. To evaluate and compare active and observational learning and their applicability to VR training in industry, we used a retention task in VR (RQ1), as well as near and far transfer tasks in the real world (RQ2). These tasks were conducted immediately after training, and after a 10–14 day delay to understand if and how learning decayed over time. To explore RQ3, we compared different demonstrator representations for the observational learning condition: fully customised realistic avatars of the user, minimally customised avatars, and dissimilar avatars.
This paper provides the first evidence to show that observational learning can be effective for learning a fine psychomotor task in VR when combined with ‘hands-on’ practice, and can lead to better far transfer to more difficult tasks where distractions are present than active learning. This has important implications for how VR training should be delivered because there are very few use cases where a real-world task does not involve any variation, and it highlights the methodological importance of including far transfer tasks when assessing VR training approaches. We validated the poor retention of learned skills after a prolonged period shown in previous research [
82], confirming the importance of when VR training is delivered. In contrast to prior work on learning of gross psychomotor skills, our analysis reveals that observing a self-avatar does not improve learning and can be distracting due to the novelty effect of the user seeing themselves virtually. This is important to consider because most learners who participate in VR training are unlikely to have experienced VR or their own self-avatars. In summary, we contribute empirical evidence showing the following:
(1)
The prevalence in industry of active learning in VR.
(2)
The effectiveness of observational learning in VR when coupled with practice.
(3)
The superiority of observational learning over active learning for far transfer effects.
(4)
Increasing avatar similarity does not improve learning for fine psychomotor skills.
(5)
Learned fine psychomotor skills decay quickly without further training.
3 Industry Interviews
Fine psychomotor skills, such as assembly tasks, are among the most commonly evaluated in the VR training literature [
1]. Prior work has emphasised the importance of these skills for industrial use cases, and the number of VR training applications for training workers continues to grow [
37,
101,
129]. However, the instructional strategies and how they are implemented in VR training are rarely discussed [
1]. To provide additional evidence for the use of VR training in this context and uncover how this type of training is typically structured, semi-structured interviews were conducted with industry stakeholders. We use the findings to motivate the methodology and inform the development of the VR training employed in our study.
3.1 Participants
We recruited participants who have experience designing, developing, delivering, or using workplace VR training via adverts posted on LinkedIn and Twitter. We interviewed 22 individuals (CEOs, Founders, Consultants, Developers, Producers, Managers, Directors, Vice Presidents) from a range of companies (Consulting, R&D, Marketing, Enterprise VR), services (Policing, Fire), and institutes.
3.2 Procedure
Anyone with relevant experience who expressed interest was given an information sheet to read before proceeding to sign up to participate. Each semi-structured interview was conducted via video call and lasted approximately 20-30 minutes. The interviewer reminded participants of the information sheet before gaining consent for the interview to be recorded. The interviews were focused around four main questions, first exploring what tasks are being simulated and taught using VR in industrial contexts (“What do you/the company you work for use VR training for?”, “What types of tasks? Could you walk me through an example?”). After establishing the types of tasks that are trained using VR, using the given examples of any tasks containing psychomotor elements, the interviewer probed into understanding how this training was approached, what actions the learner performs in the virtual environment and how they interact in VR (“Within the examples of VR training you are familiar with, what kinds of actions are involved? By this I mean how does the user interact? What types of actions do they have to perform during the training?”). Finally, the goals of the training and how/whether these are assessed were discussed with a particular focus on the metrics used. At the end of the interview, the experimenter thanked the participant for their time and offered them the opportunity to ask any follow-up questions they might have. All interviews were recorded, auto-transcribed and corrected. Participants were allowed to share their screens and utilise the chat function to share resources, videos, and images to demonstrate the virtual training that they were referring to.
3.3 Summary of Findings
A reflexive thematic analysis was used to analyse the interview transcripts and generate overarching themes. A data-driven inductive coding process was used to identify a number of codes that were grouped into sub-themes under the overarching themes [
13].
Active Learning is the Dominant Approach. Training of industrial procedures, that fall under the psychomotor domain, was unanimously achieved through ‘learning-by-doing’, i.e. active learning. The learner would gain hands-on experience of doing the procedure (“it’s completely active learning right, it’s completely learning by doing”). Within the VR environment, this would involve the learner having a first-person perspective (“so all these use cases, you are first person, you’re immersed in the experience”), so that they were in a position to do the tasks themselves (“gives you the ability to interact and do the task yourself, which actually enhances the training quotient of VR training”) and physically practise allowing them to gain muscle memory for the procedure (“get a feel of actually carrying out these activities themselves, which you know becomes a muscle memory”).
Users are Guided to Perform the Correct Actions. The learner is guided to complete the actions in the correct order (“it pretty much told them exactly what to do”; “putting someone in a room and taking them through a sequence of tasks”). This is usually achieved using a mixture of visual (“highlight the object”; “it will flash this kind of guided-mode like rotate the engine block to 180”), audio (“voice-based guidance”), and text cues (“usually a text box”). Sometimes a demonstration is incorporated where the learner observes how to complete the procedure before having any hands-on experience (“Some client requires they would like to have it through a third person-based demonstration”). This could be in the form of a virtual tutor (“when they put on the helmet there can be a virtual trainer that teach them how to perform the procedures safely”) or a ‘ghost’ trainer (“within that VR environment have like a ghost trainer sort of thing who’s doing it, and you have to follow them so they make steps and you have to go alongside and follow”).
Users Practise their Skills Virtually. In a separate practise mode, the learner receives much less or no guidance cues so that they can attempt the procedure from memory (“they just have to remember what they have to do, I think that is quite useful as a progression to actually train someone”). If the learner makes a mistake, they will receive feedback to notify them of their error (“the screen goes red so you know you’re making a mistake”) and to help them to learn the correct procedure (“we always provide immediate feedback, because the goal is to teach the user the procedures”).
Knowledge Retention and Transfer is Important and Should be Measured. During an assessment mode the learner’s actions are monitored to give an indication of their proficiency for completing the procedure (“monitor the tasks of the user”; “If you took all the steps it’s checked, if the exercise is completed successfully or with errors”). Performance measures collected whilst the learner completes the task include the time taken, the number of attempts, and errors (“did they do it effectively…in the right sequence, in the right time”; “tracking their number of attempts”). Learners are often observed or recorded during this mode so that their performance can be reviewed later. Retention of the steps can also be assessed in the form of a reassessment after completing all training modes (“an individualised assessment after the training to the user”).
Transfer activities are also utilised in some instances to further measure a learner’s understanding of a skill. This typically involves the learner completing a separate task related to the one they have been learning but with slight differences so that they must demonstrate and use their understanding rather than repeat the exact memorised procedure (
“There was a lot more, there’s more ambiguity, so it wasn’t necessarily like an SOP [Standard Operating Procedure]
…they would follow, they would have to actually solve a problem”). This is gearing learners towards transferring their skills to the real world, where scenarios are less rigid and there may be unknown factors (
“even though they know how to operate a piece of equipment, there’s always going to be a level of ambiguity. That was a really important piece for them actually transferring it to the real world”), requiring the use of more open psychomotor skills.
3.4 Design Considerations
Our findings show active training is very common in industrial VR training applications. In contrast, observing a demonstrator from a third-person perspective is rare and was only mentioned by one of the stakeholders. Based on the interviews, we draw several conclusions that inform the design of the VR training developed for this study:
DC1
Active training should provide a first-person perspective which allows the user to directly interact with the objects themselves.
DC2
VR training should have three separate phases: Training, Practice, and Assessment.
DC3
In the training phase, users should be fully instructed and guided through the procedure using verbal, text, and visual cues.
DC4
In the practice phase, users should be allowed ‘hands-on’ interaction with the objects. Guidance on how to perform the procedure should be removed to test their memory but they should be given feedback when they make a mistake.
DC5
In the assessment phase, the user’s performance should be recorded on the same task.
4 Methodology
We conducted a user study to compare immediate and longer-term training outcomes using active learning compared to observing a demonstrator in VR (RQ1), and how this affects transfer of the acquired skill (RQ2). We also explore whether there are any effects of demonstrator avatar similarity on observational learning of a fine psychomotor skill (RQ3). In a between-subject design, participants learned how to assemble a “Burr puzzle” – a 3D interlocking puzzle – using either an active or observational learning approach. For the observational approach, we also explore the effects of demonstrators having either a dissimilar, matched feature, or self-similar appearance to the user. The allocation of participants to a condition was carefully managed to ensure similar prior experience with VR, mental rotation abilities and baseline movement imagery abilities across the groups. The study received ethical approval from a Research Ethics Committee.
4.1 Burr Puzzle
Participants were tasked with learning to assemble a 6-piece interlocking 3D puzzle, known as a Burr puzzle. We chose this task because it has been used in prior research into VR training of manual tasks and represents a fine psychomotor skill with procedural elements [
14,
82]. The puzzle was designed using BurrTools 0.6.0 and the precise configuration was selected on the basis of containing 6 unique pieces and having only one solution which could be assembled in a 5-step procedure (see Figure
3). This was chosen so that the puzzle difficulty made it challenging to learn the assembly [
17]. We only focus on training one configuration because recalling multiple would increase the difficulty and could overload participants [
82]. The Burr puzzle pieces and assembly were modelled for use in VR and the pieces were 3D printed for real-world transfer tasks. The virtual pieces snap together when they are held in the correct position, as is common in VR assembly tasks [
19,
82,
125].
4.2 Avatar Creation
Avatars were constructed using Reallusion Character Creator 4
1. Self-similar avatar clothing, body shape, hair colour and style were customised to resemble each participant (see Figure
2). The Headshot plugin was used to generate a face for the avatar based on a photograph of the participant. Avatars used for the dissimilar and matched-feature conditions were given a generic uncustomised male or female body shape and the Headshot plugin was used to generate a face for the avatar based on an AI-generated photograph created using Generated Photos
2 (see Figure
2). Matched-feature avatars had the hair colour, skin tone, gender, and age (young adult, old adult) that the participant identified with the most. Dissimilar avatars had a contrasting skin tone and hair colour, and the gender and age the participant identified with the least (see Figure
2).
4.3 Virtual and Real Environments
All virtual environments were created using Godot 3.5 and consisted of a room which contained a table and a chair which participants were seated at throughout. For all conditions, participants embodied their self-similar avatar. Inverse kinematics was used to control the movements of the avatar’s arms based on the controller position, and a grip animation was played when the trigger button on the controller was held. The VR training was composed of separate Training, Practice, and Assessment phases (DC2), which are described below.
4.3.1 Familiarisation Scene.
A virtual mirror, some 3D objects, and text instructions were added to the environment to familiarise participants with the controllers and their self-similar avatar (see Figure
4a). The experimenter verbally instructed participants to complete the familiarisation procedure which taught them how to interact with virtual objects, including how to pick up and put down objects, what red and green highlighting indicates, how to assemble two pieces, what happens if they drop an object, and how to pass objects between their hands. Participants could see themselves represented as their self-similar avatar in a virtual mirror throughout the familiarisation phase to encourage feelings of embodiment and avatar identification [
48,
63].
4.3.2 Active Training Scene.
Participants in the active learning condition were given text and audio instructions describing what they would experience in the active training scene (see Figure
4c). During the scene, they were given a first person perspective and could interact with the objects directly (DC1). Each step of the assembly was guided by a text instruction and an animation prompt displaying how to connect the next piece (DC3) [
119,
126]. These would progress automatically once the participant had completed the current step being shown. The training scene ended once participants completed all 5 steps or a timer node ended the scene after 5 minutes.
4.3.3 Observational Training Scene.
Participants in the observational learning conditions were given text and audio instructions describing what they would experience in the observational training scene (see Figure
4b). In this scene participants were visually shown how to complete each step of the task by observing a third-person perspective demonstration (DC3). A professional motion capture studio was used to record an expert assembly of the physical puzzle which was converted into an animation demonstrable by the avatar and virtual pieces. The duration of the assembly was 30 seconds and it was played twice each time. Mimicking pairwise training, whereby people work together to observe and then complete the puzzle [
112], and to provide a more effective viewing angle [
43], the participant sat next to the demonstrator avatar and observed how to complete the assembly task (see Figure
4b).
4.3.4 Practice and Assessment Scenes.
We created a virtual environment in which participants could assemble the puzzle without guidance to allow them to practice and test their learning (DC4). 3D-modelled puzzle pieces were positioned on the table for participants to interact with and assemble. If pieces were dropped on the floor, they would reappear in their position on the table. Some feedback was provided to participants in the form of object highlighting, indicating whether a piece can (green) or cannot (red) be snapped together (DC4; see Figure
4d). The assessment scene was used for both the baseline and retention tests, however, object highlighting was disabled so that participants did not have help with the assembly task (see Figure
4e). In this scene the participant’s performance was recorded (DC5).
4.3.5 Real World Transfer Assessment.
To test transfer to the real world two versions of the puzzle were 3D printed at the same scale. The near transfer test used pieces that had the same colour coding as the virtual pieces and a paper template ensured that the pieces were laid out in the same order and orientation (see Figure
5a). The far transfer task involves a dual task with participants completing the same puzzle, but using non-colour coded 3D printed blocks arranged in a different orientation (see Figure
5b), whilst simultaneously ‘tone counting’ audio beeps played at random intervals to increase the overall level of difficulty.
4.4 Apparatus & Set-up
4.4.1 Hardware.
We used a Valve Index VR System powered by a PC with an Intel i7-9900k processor, an RTX2080Ti GPU and 32GB of RAM, running Windows 10 for the VR elements of the study. We used Valve Index controllers to allow robust and comfortable interaction. A GoPro Hero 11 was used to record the real-world assembly tasks for analysis purposes.
4.5 Outcome Variables
The primary outcome measures in this study are performance in the retention (in VR) and transfer (near and far) tests, indicated by the number of pieces assembled correctly and the time (seconds) to complete the puzzle assembly. A participant succeeds if they assemble all 6 pieces within 180 seconds. For the far transfer test, the number of tones identified was also measured.
Secondary outcome measures included movement imagery to further indicate encoding of the procedure and skill in long-term memory [
25,
54]. Baseline imagery ability was assessed using the revised vividness of movement imagery questionnaire (VMIQ-2) [
105] that asks people to rate on a 5-point Likert scale how well they can imagine performing each action (1 = ‘perfectly clear and vivid as normal vision’, 5 = ‘no image at all, you only know that you are thinking of the skill’) from their own perspective to measure internal visual imagery (IVI); someone else’s perspective to measure external visual imagery (EVI); and the feeling of doing the actions to measure kinaesthetic visual imagery (KVI). To measure imagery of the puzzle assembly, we replaced the generic items (e.g., ‘Bending to pick up a coin’) with task-specific items (e.g., ‘Manipulating and orienting the final piece into position’) [
74]. Scores for each subscale are calculated by summing each rating and dividing by the number of items, with lower scores indicating more vivid imagery.
Perceived competence and intrinsic motivation were measured using the Perceived Competence (PC) and interest/enjoyment (I/E) subscales of the Intrinsic Motivation Inventory (IMI) [
75]. Self-efficacy is assessed with a task-specific questionnaire developed according to Bandura’s guidelines [
9] that measures the strength of an individual’s confidence (0 – 100) in their ability to execute increasingly difficult activities (e.g., assembling 2/6 up to 6/6 pieces). Self-efficacy is calculated by summing all certainty scores and dividing by five as the number of performance standards.
Avatar identification was measured using 7-point Likert Scales (1 = Strongly Disagree, 7 = Strongly Agree) taken from the polythetic model of player-avatar identification (PAI) [
22] to assess physical similarity (5 items; e.g., ‘I physically resemble the avatar’), wishful identification (3 items; e.g., ‘sometimes I wish I could be more like this avatar’), and liking (4 items; e.g., ‘I like this avatar’).
Other measures include potential covariates such as prior VR experience, preferred learning style, mental rotation ability, and mental effort. Prior VR experience was measured using a single item rating scale ranging from 0 (‘Never used VR before’) to 4 (‘I use VR often and have developed my own environments in VR’). The Learning Style Scale (LSI) [
106] was used to assess individuals’ preferences towards concreteness versus abstractness (ACCE; 7 items; e.g., ‘I like to be specific’ -– ‘I like to remain flexible’) and reflection versus action (AERO; 7 items; e.g., ‘I value patience’ -– ‘I value getting things done’) on 6-point bipolar scales. High scores emphasise preferences toward abstract conceptualisation and active experimentation. The Revised Purdue Spatial Visualization Tests: Visualization of Rotations (The Revised PSVT:R) [
130]) was used to assess mental rotation ability, containing 30 questions in which an individual is asked to mentally rotate 3D objects. Participants select an answer from five options, and their score is given by the number of correct answers. Mental Effort was measured using the simulation task load index (SIM-TLX) [
41], which is a measure developed for workload demands placed on users in simulated environments such as VR. Participants rate 9 dimensions on 21-point Likert scales: mental demands, physical demands, temporal demands, frustration, task complexity, distraction, perceptual strain, and task control. An additional 5-point Likert scale was used to indicate the usefulness of the training environment. Finally, as presence has been shown to interact with learning outcomes in virtual environments [
91,
110] the Multimodal Presence Scale (MPS) [
71] was used to measure feelings of physical (5 items; e.g., ‘While I was in the virtual environment, I had a sense of being there’), social (5 items; e.g., ‘I felt like I was in the presence of another person in the virtual environment’), and self (5 items; e.g., ‘I felt like my virtual embodiment was an extension of my real body within the virtual environment’) presence scored on a 5-point Likert scale.
4.6 Procedure
At the point of recruitment, participants completed an online screening questionnaire. Anyone failing to meet the inclusion criteria was automatically told they were ineligible to take part, otherwise, individuals were directed to sign up. This study was conducted over three sessions.
4.6.1 Session One.
Participants completed demographic, prior VR experience, learning style, imagery and mental rotation ability questionnaires and had their photographs taken for the self-similar avatar creation. The experimenter used the questionnaire responses to allocate participants to a condition (active, dissimilar, minimal, or self-avatar) and created the avatars.
4.6.2 Session Two.
Participants completed the familiarisation task to get used to the VR environment and controls, they were allowed to ask the experimenter for assistance if they needed it. After completing this they were instructed to remove the headset and complete the avatar identification measures. They were then introduced to the Burr puzzle target shape and were given a maximum of 180 seconds to assemble the puzzle (Baseline test).
The trials then commenced, which involved two parts: training and practice. In the training phase participants were either guided how to complete the puzzle with text prompts and animations (active learning) or watched the demonstrator avatar complete the Burr puzzle (observation learning). In the practice phase, all participants were given a maximum of 180 seconds to assemble the puzzle, and we recorded the number of pieces assembled correctly and the time taken. We operationalised the trials in this way because practice is essential for the associative and autonomous phases of learning a psychomotor skill [
29]. We included practice in the observational conditions because observational learning without practice is inferior, while observational learning with practice has been shown to be comparable to active learning [
112].
The training was repeated for a total of 40 minutes up to a maximum of 10 trials. Afterwards, participants completed the questionnaire measures, and the immediate retention, near, and far transfer tests. We then conducted an interview with participants and asked the following questions to gain qualitative feedback: ‘Could you please summarise how you found the virtual training experience?’, ‘Could you describe your approach/strategy when trying to learn to assemble the puzzle?’, ‘Could you please tell me how you found observing the demonstrator avatar? [Observation conditions only]’, ‘How did you feel about your own avatar, the avatar that you embodied in the virtual environment?’, ‘How about the transfer of skills from VR to real world – did you find that the training helped?’, ‘Could you envisage using this type of virtual reality training again in the future?’, ‘Is there anything else you would like to comment on or discuss relating to the VR training experience or the instructor avatar?’. Interviews were recorded and later transcribed.
4.6.3 Session Three.
Participants returned after a 10-14 day delay to complete imagery and self-efficacy questionnaires, and the delayed retention, near, and far transfer tests. Afterwards, participants were debriefed and reimbursed £15 for their time.
4.7 Hypotheses
We expect virtual training will improve puzzle assembly skills; however, prior work suggests there will be significant skill decay after 10-14 days [
82]. Observational learning theories indicate that model similarity can enhance learning [
7,
20,
23,
67] and prior research has shown that observing avatars which are either minimally similar to users [
27] or photo-realistic self-avatars [
33] can provide a feedforward effect which improves learning and therefore task performance. We hypothesise that:
H1:
Performance in VR Retention (H1a), Near Transfer (H1b), and Far Transfer (H1c) will be worse after a 10-14 day delay compared to immediate testing.
H2:
Performance in VR Retention (H2a), Near Transfer (H2b), and Far Transfer (H2c) following observational learning with self avatars will be better than dissimilar avatars (RQ3).
H3:
Performance in VR Retention (H3a), Near Transfer (H3b), and Far Transfer (H3c) following observational learning with minimal avatars will be better than dissimilar avatars (RQ3).
We do not have hypotheses for the comparison between learning techniques (RQ1 & RQ2) because using observational learning for fine psychomotor skills is under explored in VR training and to our knowledge we are the first to directly compare active instruction to observing an avatar demonstrator in VR.
4.8 Participants
A sample of 102 participants (55M, 47F), aged 17 - 63 (
M = 31.3473,
SD = 11.057), recruited through mailing lists, social media, and posters, completed session one and session two. Of these 99 returned to complete the third session, however 6 returned outside of the 10 - 14 day window due to illness or holidays. All participants were screened prior to taking part to ensure they were aged 16 or over, had normal or corrected to normal hearing and vision, displayed no sign of colour blindness, did not have any movement-related conditions, and did not have extensive experience in completing Burr puzzles. The Ishihara test for colour deficiency [
30,
49] was used in which participants must identify the number or presence of lines in 38 pseudoisochromatic plates and anyone deemed to have colour vision deficiency was screened out. Participants were asked to rate their familiarity with Burr puzzles (0 = ‘Never heard of it’ – 3 = ‘I have solved many Burr puzzles’), and anyone scoring 3 was also screened out and excluded from the study.
5 Results
To assess the effectiveness of the different learning approaches we analysed the success rate and number of pieces assembled. Tests of normality revealed the data for both success rate and number of pieces assembled was non-normal, therefore where appropriate we report the median and interquartile range as descriptive statistics. To compare active learning to observational learning (RQ1 and RQ2) we analyse the success using binomial generalised linear mixed-effects models and the number of pieces assembled using repeated measures proportional ordinal logistic regression across both immediate and delayed tasks. Assumptions for binomial generalised linear mixed-effects models were validated using simulation-based dispersion tests using the DHARMa R package and visual inspection of Q-Q plots. Assumptions for proportional ordinal logistic regression were validated using the test of proportional odds. To explore immediate and delayed differences between conditions we conduct Wilcoxon rank-sum tests. For RQ1 and RQ2 we compare active against each of the observational conditions, and for RQ3 we perform all pairwise comparisons between the observational conditions. All post hoc tests are corrected using the Holm-Bonferonni method to account for multiple comparisons. All data, R scripts and detailed results are available in Supplementary material. Descriptive statistics for the questionnaire measures, the number of people who succeeded, and the number of pieces assembled, are available in Table
1, Table
2 and Table
3, respectively.
5.1 Manipulation Checks
One-way ANOVAs were conducted to assess the balancing of the groups in terms of their existing abilities and preferred learning styles. There were no significant differences between the groups for number of pieces assembled in the baseline test, general imagery abilities (EVI, IVI, and KVI), and ACCE, AERO, and PSVT:R scores (\(F(3,98) \le 1.546, p \ge .207, eta_{p}^{2} \le 0.045\)). PSVT:R scores were positively correlated with number of pieces assembled (r(102) ≥ .261, p ≤ .010**) and negatively correlated with time (r(102) ≥ −.265, p ≤ .008**) in the immediate and delayed retention, near, and far transfer tests. General imagery abilities were not correlated with performance (r(102) ≥ .004, p ≥ .102), nor were ACCE and AERO scores (r(102) ≥ .003, p ≥ .123).
Prior VR experience was on average none to minimal, however, there was a significant difference between the groups (H(3) = 11.181, p = .011, \(eta_{p}^{2} = 0.118\)). The minimal group had significantly less exposure to VR than the dissimilar (Mdiff = 0.509, pHolm = .023) and self groups (Mdiff = 0.469, pHolm = .038) before taking part. There was no significant difference between the active and any of the observational groups (pHolm ≥ .098). Visual inspection of scatterplots revealed no apparent relationship between prior VR experience and performance on the retention and transfer tests. There was no significant correlation between prior VR experience and number of pieces assembled (r(102) ≥ .016, p ≥ .291) or time to complete (r(102) ≥ .013, p ≥ .201) in any of the tests.
Our manipulation of demonstrator avatar similarity worked as intended, self-avatars were perceived as being the most similar (M = 5.760, SD = 0.881), followed by minimal (M = 4.638, SD = 1.382), and then dissimilar avatars (M = 3.840, SD = 1.890, \(F(2, 73) = 11.161, p < .001, eta_{p}^{2} = 0.234\)). Post-hoc tests revealed physical similarity was significantly higher for self-avatars compared to dissimilar (t(49) = 4.702, pHolm < .001***, CI = [0.943, 2.897]) and minimal avatars (t(50) = 2.773, pHolm = .014*, CI = [0.154, 2.089]). However, no significant difference was found between dissimilar and minimal avatars (t(50) = −1.974, pHolm = .052, CI = [ − 1.798, 0.169]). There were no significant effects of similarity on demonstrator avatar liking (\(F(2, 73) = 2.611, p = .080, eta_{p}^{2} = 0.067\)) or wishful identification (\(F(2, 73) = 0.339, p = .714, eta_{p}^{2} = 0.009\)).
5.2 Skill Decay
A binomial logistic regression revealed the odds of success were significantly lower in the delayed test compared to the immediate test for the active condition (OR = 0.042, CIOR = [0.003, 0.545], z(192) = −2.426, p = 0.015*, d = −1.744, CId = [ − 3.153, −0.3435]). There were no significant interaction effects between immediate and delayed testing and the other conditions, indicating similar levels of decay between active and observational conditions (OR ≤ 1.071, z(192) ≤ |0.849|, p ≥ .396, pHolm = 1.000, d ≤ |0.611|). Similarly, when comparing the success rate across the observation conditions there were no significant interactions indicating similar levels of decay between the observation conditions (OR ≤ 2.197, z(142) ≤ |0.892|, p ≥ 0.372, pHolm = 1.000, d ≤ |0.583|). Overall, VR retention decays significantly over time so we accept H1a.
A binomial logistic regression also revealed a significant main effect of delayed testing on near transfer success rate in the active condition (OR = 0.029, CIOR = [0.004, 0.237], z(192) = −3.310, p = 0.001**, d = −1.944, CId = [ − 3.095, −0.793]), indicating high levels of skill decay in the real world test 10 - 14 days after training. There were no interaction effects found when comparing active and observation conditions (OR ≤ 14.850, z(192) ≤ 2.177, p ≥ 0.029, pHolm ≥ 0.088, d ≤ |1.487|). Similarly, there were no interactions when investigating the odds of success in immediate and delayed testing between the observation conditions (OR ≤ 6.976, z(142) ≤ |1.560|, p ≥ 0.119, pholm ≥ 0.356, d ≤ |1.108|). Overall, the odds of success in the near transfer test decay significantly therefore we accept H1b and there are no significant differences observed between the conditions.
A similar binomial logistic regression indicated that there was no significant change in the odds of succeeding in the far transfer test with delayed testing in the active condition and there were no significant interactions between the observation conditions and delayed testing compared to the skill decay experienced in the active condition (OR ≤ 1.480, z(192) ≤ |0.313|, p ≥ 0.347, pHolm = 1.000, d ≤ |0.524|). Further analyses revealed no significant interactions between the observation conditions and the time of testing (immediate/delayed) relative to each other (OR ≤ 1.364, z(142) ≤ |0.285|, p ≥ 0.776, pHolm = 1.000, d ≤ |0.175|). Overall, there were no significant differences between the likelihood of succeeding in the immediate far transfer test compared with the delayed far transfer test. This finding was consistent in all conditions, therefore we reject H1c.
5.3 Active vs. Observation
5.3.1 VR Retention.
To test overall differences in the odds of succeeding in the VR retention tests in the active versus observational conditions a binomial logistic regression was run. There were no significant differences in overall retention success rates between active and observation conditions (OR ≤ 1.312, z(196) ≤ 0.651, p ≥ 0.515, pHolm = 1.000, d ≤ |0.150|). Similarly, an ordinal logistic regression on the number of pieces assembled in the VR retention tests revealed no significant differences between the active and observation conditions indicating similar performance in the VR puzzle assembly task whether active or observational learning was used (OR ≤ 1.208, z, leq|0.614|, p ≥ 0.539, pHolm = 1.000, d ≤ |0.135|). To further investigate possible differences in performance between the active and observational conditions pairwise Wilcoxon tests were conducted. These revealed no significant differences in the number of pieces assembled in the immediate and delayed retention tests, nor in the completion time between the conditions on the immediate and delayed retention tests (z ≤ |445.000|, p ≥ .051, pHolm ≥ .154).
5.3.2 Real World Near Transfer.
Binomial logistic regression for the near transfer tests revealed no significant differences in the odds of succeeding between the active and observational conditions indicating similar levels of overall success in the real world near transfer task (OR ≤ 2.518, z(196) ≤ 1.758, p ≥ 0.079, pholm ≥ 0.236, d ≤ |0.509|). An ordinal logistic regression was used to test for any differences in the number of pieces assembled in the near transfer tests, revealing no significant difference with the number of pieces assembled using observation compared with active learning conditions (OR ≤ 2.453, z ≤ |2.393|, p ≥ 0.017, pHolm ≥ 0.05, d ≤ |0.495|). To further confirm this, Wilcoxon tests revealed no significant differences in the number of pieces assembled or completion time following active and observational learning both in the immediate and delayed near transfer tests (z ≤ |458.000|, p ≥ .024, pHolm ≥ .072).
5.3.3 Real World Far Transfer.
A binomial logistic regression revealed significant effects of condition on the likelihood of succeeding in the immediate and delayed far transfer tests. All observational learning conditions were more likely to succeed in completing the puzzle in the far transfer task than the active learning condition. The dissimilar observation group were significantly more likely to succeed in the far transfer tests compared to the active group (OR = 7.976, CIOR = [1.492, 42.643], z(196) = 2.428, p = 0.015, pHolm = 0.046*, d = 1.145, CId = [0.221, 2.069]). The minimal observation group were significantly more likely to succeed in the far transfer tests than the active group (OR = 6.684, CIOR = [1.320, 33.848], z(196) = 2.295, p = 0.022, pHolm = 0.046*, d = 1.047, CId = [0.153, 1.942]). The self-observation group were significantly more likely to succeed in the far transfer tests compared to the active group (OR = 6.565, CIOR = [1.280, 33.665], z(196) = 2.256, p = 0.024, pHolm = 0.046*, d = 1.037, CId = [0.136, 1.939]).
An ordinal linear regression also revealed significant differences in the number of pieces assembled in the far transfer tests between active and all observational conditions. There were significant positive effects of all observational learning conditions compared to active learning (Active v Dissimilar: OR = 3.486, CIOR = [0.133, 0.620], z = −3.173, p = 0.002, pHolm = 0.004*, d = 0.688, CId = [ − 1.114, −0.263]; Active v Minimal: OR = 3.234, CIOR = [0.136, 0.701], z = −2.813, p = 0.005, pHolm = 0.010*, d = 0.647, CId = [ − 1.098, −0.196]; Active v Self: 2.355, CIOR = [0.184, 0.981], z = −2.004, p = 0.045, pHolm = 0.045*, d = 0.472, CId = [ − 0.934, −0.010]).
Wilcoxon tests were used to separately analyse performance in the immediate and delayed far transfer tests. Both minimal and dissimilar observational conditions significantly outperformed the active condition for number of pieces assembled in the immediate far transfer task (Active vs Dissimilar: Z = 490.000, CI = [0.30, 1.44], p = 0.004, pHolm = 0.011*, d = 0.87; Active vs Minimal: Z = 482.500, CI = [0.20, 1.33], p = 0.006, pHolm = 0.013*, d = 0.76). The dissimilar condition significantly outperformed active in the delayed far transfer task (Z = 384.000, CI = [0.20, 1.40], p = 0.015, pHolm = 0.045*, d = 0.81). There were no other significant differences in far transfer (z ≤ |444.500|, p ≥ 0.044, pHolm ≥ 0.088).
5.3.4 Imagery.
We used two-way mixed ANCOVAs to test the effect of delayed testing relative to immediate across training conditions on task-specific internal (IVI), external (EVI), and kinaesthetic visual imagery (KVI), whilst controlling for the respective general visual imagery ability as a covariate.
For IVI, a significant main effect of immediate/delayed testing was found ( F(1, 95) = 15.601, p < .001***, \(eta_{p}^{2} = 0.214\)), revealing less vivid imagery at session three ( M = 3.044, SD = 1.106) compared to session two ( M = 2.033, SD = 1.006). The main effect of condition was not significant and there were no significant interaction effects. Pairwise comparisons also revealed no significant differences for active compared to observational conditions.
For EVI, a significant main effect of time (immediate/delayed testing) was found ( F(1, 95) = 15.489, p < .001***, \(eta_{p}^{2} = 0.141\)), with all groups showing less vivid imagery after a 10 - 14 day delay ( M = 3.248, SD = 1.098) compared to immediately following the training ( M = 2.551, SD = 1.185). A significant interaction was found between condition and time ( F(3, 95) = 2.813, p = 0.044*, \(eta_{p}^{2} = 0.082\)). There was no main effect of condition. Pair-wise contrasts were used to specifically investigate any possible differential effects of active and observational learning on task-specific EVI. A significant difference was found between the active and dissimilar conditions, indicating dissimilar observational learning results in more vivid external visual imagery ( Mdiff = −0.568, SE = 0.207; t(94) = −2.741, CI = [ − 0.980, −0.157], p = .007, pHolm = .021*). There were no significant differences found for minimal or self observation conditions.
For KVI, a significant main effect of time was found on task-specific KVI, with significantly more vivid imagery upon immediate testing ( M = 2.41, SD = 1.017) compared to after a delay ( M = 3.089, SD = 1.111; F(1, 95) = 8.451, p = .005*, \(eta_{p}^{2} = 0.082\)). No main effect of condition or interaction effects were found. Pairwise comparisons also found no significant differences comparing between the active and observational conditions.
5.3.5 Other Measures.
A two-way mixed ANOVA was conducted to investigate the effect of time and training condition on self-efficacy. This revealed a significant main effect of time, whereby self-efficacy was higher immediately following the training ( M = 68.255, SD = 27.341) but decreased over 10 - 14 days ( M = 44.990, SD = 22.556; \(F(1,95) = 62.208, p < .001^{***}, eta_{p}^{2} = 0.396\)). No other main or interaction effects were significant. A one-way ANOVA revealed that there was no significant effect of condition on perceived competence at assembling the puzzle following the training ( F(3, 98) = 1.537, p = .210, \(eta_{p}^{2} = 0.045\)). A separate one-way ANOVA indicated that there was a significant main effect of condition on interest/enjoyment ( F(3, 98) = 2.819, p = .043*, \(eta_{p}^{2} = 0.079\)), however, Holm corrected post hoc tests were unable to detect any significant pairwise differences (pHolm ≥ .055). A series of one-way ANOVA’s revealed that there was no significant main effect of condition on any dimensions of the SIM-TLX, nor were there any significant difference in the reported usefulness of the different training conditions. Additionally, one-way ANOVAs revealed no significant main effect of condition on presence.
5.4 Effects of Demonstrator Similarity
To test whether demonstrator similarity affects observational learning a series of tests were run only comparing dissimilar, minimal, and self observation conditions.
5.4.1 VR Retention.
A binomial logistic regression revealed that the odds of success in the VR retention test did not differ significantly with demonstrator avatar similarity (OR ≤ 1.009, z(145) ≤ |1.034|, p ≥ 0.301, pHolm ≥ 0.904, d ≤ 0.237). An ordinal linear regression also found that there is no significant difference in the number of pieces assembled in the VR retention tests when demonstrator avatar similarity is changed (OR ≤ 0.895, z ≤ |1.181|, p ≥ 0.238, pHolm ≥ 0.713, d ≤ 0.239). Wilcoxon tests further confirm that demonstrator avatar similarity did not significantly affect the number of pieces assembled in the immediate and delayed VR retention tests, nor were there any significant differences in completion time of the immediate or delayed VR retention tests (z ≤ |380.000|, p ≥ .247, pHolm ≥ .742). Therefore, we reject H2a and H3a.
5.4.2 Real World Near Transfer.
A binomial logistic regression also confirmed that demonstrator avatar similarity did not significantly affect the odds of succeeding or failing in the real world near transfer test (OR ≤ 1.115, z(145) ≤ |1.218|, p ≥ 0.223, pHolm ≥ 0.670, d ≤ 0.423). An ordinal linear regression indicated that there was no significant effect of demonstrator avatar similarity on the number of pieces assembled in the real world near transfer test (OR ≤ 0.880, z ≤ 1.121, p ≥ 0.263, pHolm ≥ 0.788, d ≤ 0.252). Wilcoxon tests further confirm that the similarity of the demonstrator for observational learning had no significant effect on the number of pieces assembled in the immediate and delayed near transfer tests, nor were there any significant differences in completion time for immediate and delayed testing (z ≤ |421.000|, p ≥ .082, pHolm ≥ .246). Therefore, we reject H2b and H3b.
5.4.3 Real World Far Transfer.
A binomial logistic regression indicated that the overall likelihood of success or failure in the far transfer test following observational learning is not significantly affected by the level of demonstrator avatar similarity (OR ≤ 0.989, z(145) ≤ |0.275|, p ≥ 0.784, pHolm = 1.000, d ≤ 0.124). There were also no significant effects of similarity on the overall number of pieces assembled in the far transfer test (OR ≤ 0.927, z ≤ 0.930, p ≥ 0.352, pholm = 1.000, d ≤ 0.215). Wilcoxon tests confirm that there were no significant differences in the number of pieces assembled between the observational conditions during the immediate and delayed transfer tests, and completion times did not differ significantly between the observational conditions in the immediate or delayed far transfer tests (z ≤ |421.000|, p ≥ .082, pHolm ≥ .246). Therefore, we reject H2c and H3c.
5.5 Qualitative Results
An inductive coding process as part of a reflexive thematic analysis was conducted to gain further insights into VR training. The overarching themes are discussed using participant quotes as illustrative examples; text in square brackets is used to add context to a quote to make it easier to understand.
VR Training was widely regarded as being effective for learning how to assemble the puzzle (“It was very effective, I didn’t know how to solve it in advance and now I can reasonably quickly”; “I just did that quickly, there’s no way I would have done that before at all…so yeah really really effective”), despite the complexity of the task (“The task initially seemed quite daunting, but it was a very useful way of going through it.”; “to begin with it was hard to do the task without the avatar, but as it went on, I saw the person do it in front of me it became clear”).
Users were able to execute the skill in the real world (“My performance in VR accurately reflects how I performed in real life. I basically did the same thing here and there.”), which came as a surprise to some participants (“I think it was transfer surprisingly transferable actually, I essentially follow the same process that used in in VR”). Although the far transfer task was deemed the hardest:
I think that when the pieces were put out not in the same orientation it I struggled to identify which way I was expecting to see them combined with the colour; when I did the the one without colours that was a lot harder; It was the sounds causing the stress
Some aspects of the VR training mechanics limited the transferability to the real world (“I don’t think they transferred 100%”; “There was a couple of bits and pieces that weren’t as easy to transfer”). Mainly snapping and object rigidity:
everything kind of locked into place when it got right again, something that was massively helpful that didn’t mean anything in the real world; you could sort of force them in VR to just go together, whereas you could obviously can’t do that; I think that the challenges that we had in the virtual world are different for the one that we had in the real world so somethings I could manage to transfer from the virtual world to the real world. But some things that make our life easier in the virtual world we don’t have in the real world
The pieces not ‘sticking’ in the real world posed a challenge, the pieces could slide apart (“when you put them [Assembled VR pieces] down they stayed still, which I found when doing the real task was not the case. Yeah, bits would fall out”), so more effort was required to assemble the puzzle (“having to support them or something, to wiggle them a little bit to come into place. But I think that was the only complication.”), and the pieces not locking together meant participants were unsure if they were correct (“I felt like I was relying a bit on how it auto stuck them together which I realised when I started doing this [transfer tests]”; “once I put the pieces together in the the virtual environment they remains together in the real environment um I could change them”). Having to handle the rigid pieces in the real-world task also proved difficult (“The fact that I could kind of click things through other things [in VR] certainly helped a lot which obviously doesn’t transfer outside”).
Improvements in the realism of the training were suggested to allow users to better prepare for the real-world task. Despite the controllers providing an adequate proxy (“I think the controllers work. I think if it’s simple then the controllers work. The more complicated it would get perhaps I need more developed type of control.”), having full finger dexterity would be valued for learning more complex fine psychomotor tasks in VR (“if I could move all my fingers [in VR] it will help me to do the tasks more easily”; “you could see that the instructor, like using finger by finger. I was like yeah, wish we could do that.”).
Observing was effective for learning to imitate the actions:
very effective…Yeah I’ve watched him do it, and then I could do it. I don’t know how that to explain that it just it clicks in a certain way; I would just imitate the rotation of each one; just copying it. It was. It was an effective way to do it definitely
But there was a desire to have greater control over the watching component of the training e.g., their viewing perspective, pausing, and choosing when to observe:
I felt like if I was like at more of a almost like on top of him angle that might have been better; sometimes not seeing it from my perspective was kind of annoying; I imagine that it would be a lot more effective if you could pause; there should be something like I can pause that for some time and then I just have a look; maybe the chance to go back, to the instruction if we need; Sometimes it felt like it was too long, and sometimes it felt like it wasn’t long enough; after the 6th time or so you don’t really need to see the tutorial again
There were some advantages of using an avatar over a real person such as feeling more comfortable (“it’s just an avatar so I didn’t feel judged”), having fewer distractions (“an avatar... is less distracting...so it helps to focus on the task”), and having a consistent demonstration to follow (“it was good seeing the same thing over and over again”; “Obviously what he did was exactly the same every time whereas it might be slightly different if it’s an actual person.”). Otherwise, observing the demonstrator avatar was akin to watching a real person, the movements were realistic (“It felt fairly realistic, like watching someone in in terms of the movements.”; “The movements were very smooth from the Avatar and it was very easy to follow what the Avatar was doing”). However, the lack of communication was noted:
if I have the chance to ask questions from the person, yeah, I would prefer to have a person because besides looking, I could make some questions.; so I couldn’t be like oh stop. I want to see exactly, like turn it around. I want to see exactly what you’re doing, or can you do that again? It’s just she did it
Self-avatars present some disadvantages to learning. Most participants felt that their self-avatar resembled them (“it was pretty accurate to my real appearance, I didn’t expect it to have like the same outfit, the same like hair, face, It’s pretty cool”). Some liked having a self-avatar (“I think I related more with the whole experience just because I have an avatar similar to me”; “Nice to have an avatar that similar to me.”) but viewing a self-avatar was not always positive – evoking uncanny valley effects (“I thought it something just looked kind of strange”; “I was both kind of repulsed and amazed at the same time”) and became a distraction:
it was quite creepy, I probably wish it didn’t resemble me. because again... sort of… I would almost judge myself against that: It should be me doing this better than I do!; In the beginning it was distracting cause I was trying to like compare between myself and the avatar. Yeah. And then towards the end I was like I actually want to complete it; I think I was just shocked cause like ohh that’s me
For others, the novelty of having a self-avatar wore off quickly (“I didn’t really care after the first time that it looked like me, so it was pretty normal”). Comparatively, minimal and dissimilar avatars were not distracting:
I can’t remember the [minimal] instructor particularly well, I think I was mainly focused on the on the actual blocks rather than the instructor; I wasn’t really concentrating on the [dissimilar] instructor themselves to be honest I was just very focused on the on the puzzle. So yeah, it could have been anything, anyone or anything sitting there.
6 Discussion
Our findings demonstrate that VR training was successful for learning an assembly task with the majority (77-89%) of participants able to complete the puzzle in the VR retention test after the training. This is in keeping with prior work which has shown the effectiveness of VR training for acquiring fine psychomotor skills [
34,
82,
125], and participants across all conditions commented on the effectiveness of training virtually. Whilst existing VR training applications almost exclusively use active learning [
2,
4,
11,
88,
95,
99,
125] which is generally considered more effective than observational [
112], our findings indicate that observational learning combined with practice is highly effective as a learning approach in VR. All participants assembled the puzzle in the practice phases, but those in the active learning condition received twice as much ‘hands-on’ experience because they also assembled the puzzle during the training phases. In contrast, participants in the observational learning conditions observed the puzzle being assembled during the training phases. Despite this, we were unable to detect significant differences between the active and observational conditions (RQ1). Our analysis reveals that, with 95% confidence, any differences in overall success rate for the retention task in VR would result in a medium effect size at most. These similarities apply not only to the performance during the retention task, but also to user experience as measured by enjoyment, perceived competence, presence felt, and physical and mental workloads placed on the users.
Similarly, we found no significant differences between active and observational learning for transferring skills to the real world when the task is exactly the same. While this does not necessarily mean they are equivalent, it does show that they are both effective for acquiring fine psychomotor skills in VR, in line with related work suggesting a degree of functional equivalence between action and observation [
53,
80,
94,
119]. However, we provide evidence to show observational learning significantly improves the ability of users to transfer their learning to real-world tasks beyond the context learned in VR compared with active learning (RQ2). The odds of succeeding in the far transfer tests were between 6.6 and 7.9 times higher for those in the observational learning condition compared to those in the active condition, with a large effect size for each of the individual observation conditions. We also observed this effect was immediate with two of the observational conditions (dissimilar and minimal) compared with active. This phenomenon has also been demonstrated outside of VR training, indicating that observing contributes to learning in a way that allows the individual to apply their skills more easily to variations of a task [
85,
109,
112]. This has important implications for VR training because operating in an unchanging environment is extremely rare, and real-world transfer is important in the majority of tasks where VR training is already deployed, or will likely be deployed in the future [
14,
69,
82,
92]. This also has important methodological implications for VR training research because most prior VR training studies only focus on measuring near transfer as an indicator of success [
2,
14,
34,
44,
82,
92]. Our results demonstrate that far transfer success is much lower than near transfer across all conditions, yet is more applicable to industry where these techniques will be deployed. Therefore, integrating and studying far transfer as part of VR training research methodology should be prioritised.
The far transfer benefits of observational learning may be explained by the cognitive processes that occur during learning [
112]. This can be seen by the fact a dissimilar avatar leads to significantly stronger external visual imagery for the task than active learning, which indicates that there are differences in the immediate cognitive effects of the learning processes which are strongly correlated with performance [
32,
54,
76]. Additionally, attentional resources available during learning are likely to play a role in explaining these differences. In the beginning, when the task is unfamiliar, observational approaches allow users to direct their attention to the requirements of the task and focus on cognitively understanding the complex nature of the puzzle [
127]. In contrast, active learning is more likely to direct cognitive resources into physically completing the task which may affect their cognitive understanding of the procedure as a whole. Additionally, the demonstrator avatar consistently shows participants an effective series of actions for assembling the puzzle in the observational conditions, which could result in users adopting the same strategy early on in the learning process [
112]. This avoids issues that can occur with active learning where it is more likely that users develop unhelpful techniques before developing an effective refined strategy [
8,
15,
127].
In line with prior work our results show significant skill decay across all conditions after a 10 – 14 day delay between training and performing the tests [
82]. Only 28-39% of participants were able to completely assemble the puzzle in the delayed VR retention task, representing a nearly 50% drop in success rates, and significant decay was also observed in the near-transfer tasks. The significant deterioration in participants’ ability to imagine completing the task from an external, internal, and kinaesthetic perspective after the delay demonstrates that they forget how to perform the skill. This reiterates the importance of repeating training and not allowing long periods of time to pass between being trained virtually and utilising the skill [
38,
61]. However, there was no significant decay in far transfer skills. This is likely due to the overall poor performance in the far transfer tests immediately following the training. For example, only 15% of those trained in the active condition were able to successfully assemble the same puzzle when the colour-coding, set orientation, and distraction-free environment were removed.
The observational learning condition included a hands-on practice component so we cannot draw conclusions about how much of the skill was gained purely from observation. Prior research suggests that observation without practice would be inferior to active training [
112] as gaining ‘hands-on’ experience is an essential part of learning psychomotor skills [
29]. Including practice in the observational group has likely increased the learning gains; however, any differences observed are likely due to the active versus observational training approach because the practice phases were the same for both conditions.
We observed high variability in performance across the participants and therefore, within the groups. One potential explanation is that most participants had little to no prior VR experience, which has been known to impact performance in VR [
108]. However, prior VR experience was not correlated with performance, suggesting that the initial familiarisation scene sufficiently mitigated effects of prior experience. Mental rotation ability is another factor that likely explains the high variability in performance, with those scoring highly more likely to succeed in an assembly task [
42]. We balanced the groups to avoid this becoming a confounding variable.
We provide the first evidence about the implications of applying feedforward learning for fine psychomotor tasks in VR – with surprising results. In contrast to prior work [
27,
28,
33], we find no significant effects of avatar similarity on learning and it is likely that only small effects exist. However, qualitative insights reveal that using fully customised more realistic representations of the user in observational learning was more likely to produce uncanny valley and novelty effects which can distract the user and prevent them from focusing on the learning task [
81,
123]. These effects of self-avatars are important to note as few people will have been exposed to highly realistic digital models of themselves and therefore this phenomenon would also likely manifest in industrial VR training applications. Therefore, avoiding similarity between the user and the demonstrator avatar altogether could be the most appropriate approach for observational learning of fine psychomotor skills in VR (RQ3). The juxtaposition in our findings compared to related work [
27,
33] is likely due to the difference in learning gross versus fine psychomotor skills, where the emphasis shifts from looking at the avatar as a whole performing full body movement to focusing on just the hands performing a manual task. It could be that interest in the avatar’s appearance removes the user’s attention [
93,
121,
122] from the puzzle task and this inhibits any possible feedforward learning benefit.
6.1 Limitations & Future Work
We provide a first comparison of active and observational learning techniques in VR. Whilst we provide evidence to show the advantages of implementing observational methods in combination with practice in VR, we found that providing an animated avatar to demonstrate a skill currently relies on an intensive workflow – often requiring state-of-the-art motion capture systems such as the one utilised in this study and other related work [
33]. However, markerless motion capture and animation technologies continue to improve (e.g. MoveAI
3) and we anticipate that observational learning will become easier to develop and more scalable to implement.
We selected the Burr puzzle task as an example of an assembly task because it requires the same skilled elements (e.g. part recognition, selection, rotation, aligning, and fixing) as real-world industrial assembly tasks (e.g. electronic actuator assembly [
34], pump maintenance [
125]). However, Burr puzzles are arguably more abstract than most real-world assembly tasks, so the question of how far active and observational learning in VR transfer to real-world tasks is a direction for future work. Our task was neither too easy nor too hard for participants: most were unable to perform it immediately and almost all were able to perform it by the end. However, future work should explore VR training for tasks with varying levels of complexity.
Interactions with the puzzle pieces were achieved using controllers, which is generally preferred over freehand interaction in VR training applications [
104]. Whilst this proved to be an acceptable proxy, some participants expressed a desire for full finger dexterity to enable more fine-grained control and manipulation. Full finger dexterity would become especially important for learning more complex fine psychomotor tasks in VR (e.g. that contain smaller pieces, finer movements), and therefore integrating alternative haptic interaction methods (e.g. SenseGlove
4, Manus
5, HaptX
6) into VR training may be necessary in the future.