A method for evaluating the learning concentration in head-mounted virtual reality interaction

Lin, Yi; Lan, Yangfan; Wang, Shunbo

doi:10.1007/s10055-022-00689-5

A method for evaluating the learning concentration in head-mounted virtual reality interaction

Original Article
Open access
Published: 27 September 2022

Volume 27, pages 863–885, (2023)
Cite this article

Download PDF

You have full access to this open access article

Virtual Reality Aims and scope Submit manuscript

A method for evaluating the learning concentration in head-mounted virtual reality interaction

Download PDF

4420 Accesses
9 Citations
2 Altmetric
Explore all metrics

Abstract

In education, learning concentration is closely related to the quality of learning, and teachers can adjust their teaching methods accordingly to improve the learning outcomes of students. Particularly in head-mounted virtual reality interactions, current methods for assessing learning concentration cannot be fully applied to new interactive environments because immersion shaping and cognitive formation differ from the conventional education. Therefore, in this study, a learning concentration assessment method is proposed to measure the learning concentration of students in head-mounted virtual interaction, using the expression score, visual focus rate, and task mastery as evaluation indicators. In addition, the weights of the evaluation indicators can be configured to be included in the calculation of learning concentration depending on the characteristics of different types of courses. The results of a usability evaluation indicate that the learning concentration of students can be effectively evaluated using the proposed method. By developing and implementing strategies for optimizing learning effects, the learning concentration and assessment scores of students increased by 18% and 15.39%, respectively.

A novel method for improving the perceptual learning effect in virtual reality interaction

Article 15 March 2022

Investigating the Relationship Between Students’ Preferred Learning Style on Their Learning Experience in Virtual Reality (VR) Learning Environment

Learning Through Immersion : Assessing the Learning Effectiveness

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Virtual reality (VR) technology can overcome the time and space limitations in conventional education. The immersive learning experience provided by this technology can promote learning motivation and situation cognition, and enhance learning experience. Therefore, VR technology has been extensively applied in education to improve the quality of teaching in recent years (Kim et al. 2020; Sutjarittham et al. 2019; Tsai et al. 2020). Learning concentration is a crucial factor affecting the learning effect in conventional education, and it reflects students' degree of learning attention (Arana-Llanes et al. 2018; Castelló et al. 2020). Thus, learning concentration influences the learning effect in a virtual environment. If the learning concentration of students during the interactions in VR education can be effectively evaluated, then it will help to adjust their learning status and thus the learning effect can be improved. Accordingly, in this study, the learning concentration of students in head-mounted VR interaction was determined by analyzing the characteristics of VR education.

Significant progress has been made in detecting learning concentration in conventional classrooms. Guo et al. (2018) proposed a convolutional neural network (CNN)-based analysis method for evaluating the learning concentration suitable for conventional teaching. The learning concentration of students was quantified using micro-expressions as a quantitative index. In 2020, using the head-up rate and facial expression recognition (FER) results as evaluation indicators, Shi (2020) developed an analysis method to assess the learning concentration for conventional education. Although these methods have achieved better results in the concentration analysis of conventional education, they are not applicable to the evaluation of learning concentration in head-mounted VR environments. Notably, students' attention is usually focused on blackboards or instructors in conventional classrooms. Conversely, the students' attention changes as the focus of interest alters when interacting with head-mounted displays (HMD) in a VR environment, which means that it is no longer focused on a fixed area. In addition, the eyes, eyebrows, part of the nose, and other essential expression features of students are obscured by the HMD, resulting in a significant reduction in the accuracy of FER using conventional methods.

To solve these problems that the conventional methods were no longer suitable, a concentration analysis method for the VR interaction was explored. The concentration analysis method based on FER primarily uses sensors to capture data (Sutjarittham et al. 2019; Tsai et al. 2020). Facial and eyes data collected from electromyography (EMG) sensors were analyzed according to the electrical signals generated by facial muscle movement. However, in these methods, students have to wear various sensors, such as brain wave sensors, electromyograms, and skin electrical sensors, which cause discomfort. In addition, because the human body cannot be entirely obscured by all the sampling points of the sensor, FER has low precision (Shen et al. 2019).

Considering these problems, more CNN-based FER methods have been developed. Teng (2017) proposed an FER process in a virtual environment based on LeNet. Image data were used as the FER data source to avoid uncomfortable experiences caused by wearing EMG sensors. However, the FER network constructed using this method lacked sufficient training owing to the limited dataset. Thus, the average FER accuracy was 69.39%, which can be further improved.

To further enhance the FER rate in a VR environment, Wu (2019) developed an FER method by reconstructing face image. The face image obscured by an HMD was reconstructed based on generative adversarial networks, and visual geometry group 16 (VGG16) was used to improve the feature extraction effect. The evaluation results indicated that the FER rate for the CK + and restored CK + datasets reached 98.8% and 94.8%, respectively. Although the FER accuracy is higher in a virtual environment when the deflection angle is small, it has inherent defects. First, the frontal unblocked picture of the user should be obtained in advance as a reference picture. Otherwise, the blocked face image cannot be accurately reconstructed, which affects the accuracy of FER. Second, owing to the high degree of freedom (DOF) of the VRI, the image captured by the camera may contain a more extensive adjustment range of head posture, further reducing the accuracy of the FER method.

In the method based on interaction data, using the number of missed errors and reaction time as the main concentration evaluation indicators, Yeh et al. (2020) constructed an automatic assessment system for attention deficit disorder by selecting focus time and total rotation angle as the concentration evaluation metrics. By analyzing the interaction data, this system automatically determines whether the user has attention deficit hyperactivity disorder. However, the primary assessment target of the system is children suspected to have attention deficit hyperactivity disorder and is relatively simple in terms of task design, making the system somewhat less scalable.

In this study, we proposed a method for evaluating the learning concentration in head-mounted virtual reality interaction (VRLC). The VRLC has the following characteristics:

1.
According to the high DOF characteristics of the operations in VR interactions and the diversity of virtual scenes, expression score, visual focus rate, and task mastery were set as comprehensive indicators for evaluating learning concentration. Moreover, the graded valence emotion can be set based on the characteristics of different types of VR education systems, and the corresponding expression weight can be assigned to calculate the learning concentration score of students. The optimization strategy was formulated by studying the user’s interaction behavior pattern from an analysis of the learning concentration score to improve the interactive experience of the VR education system. Finally, the learning effect is enhanced;
2.
In a head-mounted VR environment, the recognition rate of existing FER methods is reduced owing to the larger adjustment range of head posture. Thus, by simplifying the attention mechanism (Abdullah et al. 2019; Maraza et al. 2020), we proposed an FER method suitable for head-mounted VR interaction (FERVR). By fusing the global and local features, the weights of the unobscured local areas increased. Thus, the influence of HMD occlusion and a more extensive head posture on FER were reduced, and the reliability of the learning concentration score in VRLC was enhanced by improving the FER rate in a VR environment.

2 Method

As shown in Fig. 1, the VRLC process includes the calculation of learning concentration in head-mounted VR interaction and research on FER methods. The optimization strategy was formulated by studying the user’s interaction behavior patterns derived from the analysis of the learning concentration and assessment scores to improve the interactive experience of VR education systems.

In conventional education, facial expressions and head-up rate can be used as indicators for measuring learning concentration (Guo and Zhang 2019; Shi 2020). However, the evaluation of learning concentration when wearing an HMD for VR interaction differs from that in a conventional classroom. First, students’ engagement and experience with VR interaction are more valuable in achieving knowledge acquisition and skill consolidation through practicing and reflecting. Second, because interactive objects in a virtual environment exist anywhere in a three-dimensional (3D) space, the student's visual focus is not limited to a specific area. The evaluation indicators for measuring learning concentration in a VR environment are shown in box ① using the blue line in Fig. 1; in this study, expression score, visual focus rate, and task mastery were proposed as comprehensive evaluation indicators for measuring the learning concentration.

To provide the FER results for the expression score from the first step, we simplified the concentration mechanism according to the characteristics of occlusion and the larger adjustment range of head posture in a VR environment, and then proposed FERVR. Global and local features were fused using FERVR to recognize emoticons. As a result, the influence caused by HMD occlusion was reduced and the robustness of the larger adjustment range of head posture was improved. As shown in Fig. 1 ②, because few expression datasets from the VR environment using HMD are available, we generated a new dataset suitable for head-mounted VR interaction by adding the HMD device mask to the eye position of the face images in the Radboud Faces Database (RaFD) (Langner et al. 2010). After the FERVR training was completed, the facial expression data during virtual learning were used as the input data for FERVR. Finally, the FER result was applied to the expression score calculation.

The steps for analyzing the learning concentration are shown in Fig. 1 ③. After the learning concentration score has been derived by steps 1 and 2, the learning concentration of students combined with the assessment score of the VR education system can be analyzed to improve the learning effect. Further, psychological counseling and VR education system optimization were used for the development of learning effect optimization strategies. Students with a low concentration score were guided to actively participate in the experience of the VR education system using psychological counseling. According to the analysis results of the learning concentration, the shortcomings of the VR interaction design can be found. By formulating and executing VR education system optimization, interactive experiences and teaching quality can be improved.

2.1 Calculation of the learning concentration score

2.1.1 Evaluation indicators for measuring the learning concentration

Learning concentration is a meta-construct that includes emotional, behavioral, and cognitive focus (Fredricks and Mccolskey 2012). Emotional focus is related to the students’ emotional involvement during learning activities (Christenson and Reschly 2012). Positive emotions include enthusiasm, interest, and enjoyment while learning (Renninger and Hidi 2016), whereas negative emotional components include boredom, sadness, and frustration in the classroom (Skinner et al. 2008; Skinner 2016). Theories of motivation, including the self-determination and control-value theories of academic emotions (Deci and Ryan 1985; Pekrun and Linnenbrink-Garcia 2012), emphasize the role of both positive and negative emotions on the students’ involvement in learning activities and underscore how affective dynamics can sustain or disrupt learners’ engagement to impact learning performance (D’Mello and Graesser 2012; Pekrun and Perry 2014; Gupta et al. 2018).

Behavioral focus is the degree to which students are active in learning activities (Fredricks et al. 2004). This is reflected in the students’ ability to effectively execute cognitive strategies and put action and effort into achieving learning goals (Sinatra et al. 2015; Alemdag and Cagiltay 2018). Moreover, behavioral focus is considered key to success (Sinatra et al. 2015).

Cognitive focus is the student’s level of investment in learning (Meece et al. 1988; Parong and Mayer 2021). It includes being thoughtful, strategic, and willing to exert the necessary effort for comprehension of complex ideas or master difficult skills (Fredricks et al. 2004). Cognitive focus measures are considered to have self-regulation and motivation components (Fredricks et al. 2004; Ainley 2012; Christenson et al. 2012), and they have been found to affect various positive outcomes, including motivation and learning achievements (Guthrie et al. 2004; Chi and Wylie 2014; Greene 2015).

In summary, emotional, behavioral, and cognitive focus are the three dimensions that can effectively reflect the learning concentration of students. Therefore, as shown in Fig. 2, we used the expression score, visual focus rate, and task mastery as the indicators to quantify the impact of these dimensions and used a set of formulas to calculate the learning concentration.

2.1.1.1 Emotional focus and expression score

Psychologist Mehrabian suggested that emotional information can be expressed by 7% of language, 38% of voice, and 55% of facial expressions (Li et al. 2019). Therefore, facial expressions have an essential role in emotional expression.

The basic emotions within the valence-arousal-dominance model were constructed by Arya et al. (2021), as shown in Fig. 3a. They categorized emotions into three dimensions: valence, arousal, and dominance. Valence describes the feeling of a negative one to the positive one. Arousal defines the strength of an emotion, how much the person feels about that emotion like when he is excited. Dominance, the third-dimension, is related to the strength of an emotion. It also represents the degree of control generated by the stimulus. A person feels controlling or submissive about something (Mitruţ et al. 2019; Arya et al. 2021). The most common model used is the circumplex model of affect spanned by valence and arousal dimensions (Russell and Barrett 1999; Zangeneh Soroush et al., 2018), in which emotions are categorized into two-dimensional circular space. As shown in Fig. 3b, the vertical axes represent arousal and the valence dimension is represented by horizontal axes. In this model, the various types of expressions were classified as low-valence negative affect and high-valence positive affect, which were set to negative and positive values, respectively (Russell and Barrett 1999; Zangeneh Soroush et al. 2018).

Guo and Zhang (2019) combined the 3D learning state space with the affective dimension theory and proposed an evaluation model of classroom attention. The model sets the weight values of expressions not related to the classroom (fear, expressionless) to 0, and the weight value of emotions when students are very dissatisfied with the classroom content (disgust, contempt, anger, sadness, and confusion) was set to − 2. In addition, the emotions of happy and surprised, which were considered as satisfied with the class content, were given a weight value of 2. Because of the fact that students showed no significant emotional expressions (neutral expressions) when they listened attentively, Shi (2020) believed that setting the weight value of neutral expressions to 1 can improve the accuracy of learning concentration assessment.

During the design of the VR education system, an explicit and careful thought-out educational purpose is essential (Boutefara and Mahdaoui 2020). VR can stimulate students' motivation and interest in learning, and their evaluation of VR learning content can reflect whether the design of the VR education system is reasonable (Suhaimi et al. 2020; Tai et al. 2022). We set the expression score as the evaluation indicator of emotional focus to measure the students' interest in the VR learning content. In addition, considering the diversity of VR education scenarios, we classified the graded valence emotion in head-mounted VR interaction as high-valence positive affect, medium valence neutral affect, and low-valence negative affect.

High-valence positive affect indicates a category of expressions that are consistent with the instructional purpose of the VR education system. The presence of this type of expression suggests that students were immersed in the virtual experience and they focused on the interactive content, resulting in higher expression scores.

Conversely, the expressions that designers of VR education systems do not expect to appear are set as low-valence negative affect. Expressing this type of affect indicates that students are not interested in the current virtual experience.

Medium valence neutral affect is in a state of emotional ambiguity between high-valence positive affect and low-valence negative affect, and emotions in this state cannot provide a clear valence (Guo and Zhang 2019; Shi 2020). Medium valence neutral affect is not what the designers of VR education systems expect from students, and it does not accurately measure the level of interest in the virtual experience. In conventional classroom education, expressionless usually falls into the category of medium valence neutral affect. However, in a head-mounted VR experience, expressions other than expressionless can be set as medium valence neutral affect in specific situations. Configuration rule of expression weight are summarized in Table 1.

Table 1 Configuration rule of expression weight

Full size table

Referring to the measure of learning concentration developed by Shi (2020) and Guo and Zhang (2019), we propose a set of formulas for calculating the expression score. Each student’s expression score (f_k) is calculated by multiplying the proportions of various expressions and expression weights. Equation (1) shows that the eight categories of expressions (happiness, sadness, disgust, surprise, fear, anger, contempt, and expressionless) are represented by expression number i (0 ≤ i ≤ 7). The frequency of each expression is represented by T_i (0 ≤ i ≤ 7). N is the total number of expressions, and C_i (0 ≤ i ≤ 7) represents the weights corresponding to each expression category.

$$f_{k} { = }\frac{{\sum\nolimits_{i = 0}^{7} {T_{i} } }}{N} \times C_{{\text{i}}}$$

(1)

To formulate the evaluation criteria for measuring expression score, the expression score of students is normalized by the polar linear method (Pedram et al. 2020). As shown in Eq. (2), by comparing the maximum and minimum expression scores, the expression score of each student is normalized to the range [0,1].

$$fk^{*} { = }\frac{{f^{^{\prime}} k - \min (fk)}}{\max (fk) - \min (fk)}$$

(2)

2.1.1.2 Behavioral focus and visual focus rate

In conventional classroom teaching, the presentation area of knowledge points is fixed on a blackboard or a projection screen. Consequently, students’ degree of learning concentration can be measured using indicators such as facial expressions and head-up rate. However, owing to the high DOF of VR interactive operation, the position of the student's visual focus constantly changes, making conventional metrics inapplicable for measuring learning concentration in HMD-wearing contexts.

A visual focus data channel is another objective means of discerning cognitive engagement fluctuations during learning (D’Mello et al. 2017). By monitoring the visual focus, feedback can be provided on the learner's state in response to specific stimuli. When students' vision is steadily focused on a point, it indicates that they are taking more action to accomplish their learning goals (D’Mello et al. 2017). In addition, visual focus can reveal the decision-making process (Krejtz et al. 2016). Choice behavior and gaze allocation are related, thereby people tend to look longer at the item they will choose than at the item they will reject, thereby generating a gaze bias effect (Thomas et al. 2019). Accordingly, the visual focus rate was set as one of the indicators for evaluating the degree of learning concentration.

In a virtual environment, explaining knowledge points using an avatar is a common form of interaction. During explanation, the time that the students' sight stayed on the avatar or the presentation area was set as the focused learning time. The visual focus rate, which is the ratio of focused learning time to the total time spent on explaining knowledge, was set as another measure of learning concentration. The process of knowledge points seeking is shown in Fig. 4; when constructing the VR interaction, a ray is emitted from the position of a student wearing the HMD to the front direction to simulate the attention line of sight. It can be used to measure the visual focal rate by determining whether the rays are located within the knowledge presentation area. Figure 4a shows that the focused learning time starts counting when students observe the presentation area of knowledge points, indicating that they are in a state of focused learning. As shown in Fig. 4b, students are not in a focused learning state when their eyesight stays out of the presentation area of knowledge points; thus, the focused learning time stops counting.

The visual focus rate (R_c) is calculated as in Eq. (3):

$$R_{c} { = }\frac{{T_{c} }}{{T_{t} }}$$

(3)

where T_c is the focused learning time and T_t is the total knowledge explanation time.

2.1.1.3 Cognitive focus and task mastery

In a conventional classroom lecture-teaching setting, students' mastery of learning content affects the final learning outcome. It is difficult for instructors to ensure that all students master the skills being taught when the number of students is large and the course duration is limited. A common method for testing students' task mastery is through individual and group questioning. However, any approach that relies on the instructors' subjective judgment inevitably leads to deviations.

Cognitive focus, which is a complex process of cognitive processing and information handling (Liu and Wang 2017), reflects students’ mastery of learning content (Kim and Schatschneider 2017; Wong 2018). This requires students to activate priori knowledge and access memory, and then analyze, integrate, transfer, and create new knowledge, information, and problems (Liu and Wang 2017). The longer the students spend at a knowledge point, the more difficult it is for cognitive processing. Therefore, they invest more cognitive effort (Liu and Chuang 2011; Krejtz et al. 2016). Kruger and Doherty (2016) discovered that the time students spent on a reading task was significantly related to their degree of cognitive focus. Furthermore, the degree of cognitive focus can be predicted by the duration of a reading task (Kim and Schatschneider 2017; Wong 2018).

The aforementioned knowledge construction and cognitive focus theories are also applicable to designing VR educational scenarios. The knowledge acquired in virtual scenarios is usually divided into multiple task prompts; after the guidance of these prompts, students deepen their cognitive understanding by comparing and analyzing such knowledge. We recorded interaction data related to students' reading task prompts to assess their cognitive focus on the knowledge points in the virtual scenario. First, the learning task reminder in a virtual environment appears in sequence according to a pre-planned trigger mechanism that can continuously guide the student to learn stepwise. Additionally, every task prompt is guided by a combination of text, voice, and icons. Thus, students can verify the task reminder according to their mastery of knowledge points or re-watch them later. The time to confirm the current task reminder is set as the reading completion time of the learning task when the student reads the task reminder, and task mastery is the ratio of this parameter to the total time of the learning task prompt. In summary, the expression score, visual focus rate, and task mastery are used as comprehensive evaluation indicators for learning concentration. The task mastery (Rm) is calculated as in Eq. (4):

$$R_{{\text{m}}} { = }\frac{{T_{r} }}{{T_{g} }}$$

(4)

where T_r is the learning reading completion time of the learning task, and T_g is the total time of the learning task prompt.

2.1.2 Learning concentration calculation

The calculation process of the learning concentration for head-mounted VR interaction is shown in Fig. 5. It mainly comprises three steps: facial expression score calculation, interaction data analysis, and learning concentration score calculation.

As shown in Eq. (5), the learning concentration score is weighted by the expression score, visual focus rate, and task mastery, which are the comprehensive evaluation indicators of virtual learning concentration. α, β, and γ are set as the weights for the above three indicators, respectively. Among them, the expression score reflects students' concentration on VR experience content, and the visual focus rate and task mastery reflect students' attention on the presentation area of knowledge points.

$${\text{Score = }}\alpha \cdot f_{k}^{*} + \beta \cdot R_{c} + \gamma \cdot R_{m}$$

(5)

In this study, the analytic hierarchy process (AHP) was used to determine the weights distribution of the expression score, visual focus rate, and task mastery in calculating learning concentration (Shete et al. 2020). AHP is a decision-making method that decomposes elements related to decision-making into levels of decision goals, intermediate-level elements, and alternatives. On this basis, qualitative and quantitative analyses were performed. This method is a comparative hierarchy derived by experts after comparing every indicator according to the meaning of the weights; thus, it has high reliability and low error.

A total of 21 senior engineers and six interaction designers from the VR development departments of Netdragon and Huayu Education Technology were invited to participate in the expert review to evaluate the indicator weights of virtual learning concentration. In this expert scoring review, the relationship between importance was compared with the dimensions of expression score, visual focus rate, and task mastery. The relative numbers of the scale were set as 1, 3, 5, 7, and 9, which indicate equal, slight, significant, vital, and extreme importance, respectively. The findings demonstrated that the consistency ratio from the expert rating was 0.0516, which was less than 0.1, and it passed the consistency test. The final weights of the indicators calculated by the AHP were 0.4074, 0.3735, and 0.2191. Therefore, as shown in Eq. (6), the calculated weight values are substituted into α, β, and γ to obtain the formula for learning concentration.

$${\text{Score = }}0.4074 \times f_{k}^{*} + 0.3735 \times R_{c} + 0.2191 \times R_{m}$$

(6)

Based on the results of the weight values, the following conclusions can be drawn. First, experiential and instructional contents are essential components of VR education because it is a novel teaching practice of experiential learning. The expression score and visual focus rate were relatively more important for the assessment of learning concentration. In addition, although task mastery can assess the students' understanding of learning tasks, it is influenced by their learning and understanding abilities. Therefore, the weight value of task mastery was relatively low to reduce the impact of the subjective ability factors of students on the calculation of learning concentration.

2.2 Research of the FER method

The expression score was calculated from the results of FER and expression weights. Aiming at the characteristics of a larger adjustment range of head posture in head-mounted VR interaction, we proposed a FERVR framework to improve the accuracy of FER in a VR environment.

2.2.1 FERVR framework

In recent years, attention mechanism has been applied to achieve better FER results in the presence of partially occluded faces (Abdullah et al. 2019; Maraza et al. 2020). The attention mechanism in FER draws on the human visual selective attention mechanism; that is, the human eye quickly scans global images to obtain the target region to be focused on (Jiao et al. 2021). Therefore, more attentional resources are invested in this focus region to obtain more detailed features, and useless information is suppressed. The attention mechanism in FER divides a global image into several local images and adaptively adjusts the region weights according to the degree of occlusion of the local image. In a VR environment, because the occluded areas of the HMD are relatively fixed, the local regions with higher weights are determined. Therefore, although the HMD occludes important facial recognition features, such as eyes, eyebrows, and part of the nose, the mouth area is clear. If the mouth area can be used as the main target area for FER, the influence of the HMD occlusion is suppressed.

The images captured by the video devices have a large deflection angle when the rotation angle of the head is wide. Although FER can effectively reduce the influence of HMD blocking by using local features, the recognition rate is reduced when the head deflection angle is wide. By contrast, the global image from the VR environment contains HMD occlusion, local FER feature region, and overall pose feature. If the global and local features are fused, the robustness of the FER network in the presence of a wide head deflection angle can be improved while reducing the effect of HMD occlusion.

Accordingly, by simplifying the attention mechanism, we proposed the FERVR framework by fusing global and local features. An overview of the proposed framework is shown in Fig. 6. First, the face image of the HMD wearer was imported into the FERVR. Second, the input image was divided into a global area containing all facial information and an unobscured local area. Thereafter, both areas were imported into the feature extraction network to obtain global and local features. Finally, after the fusion features were normalized, the facial expression classification results were obtained.

2.2.2 Feature extraction

Extracting more effective expression features improves the FER rate, therefor designing a feature extraction network can directly influence the final recognition accuracy. The low-dimensional information obtained using conventional feature extraction methods such as supervised latent Dirichlet allocation (sLDA) (Rajan et al. 2019) and multi-support vector machine (multi-SVM) Guo and Zhang (2019), have insufficient expressiveness, resulting in limited recognition capacity. By contrast, FER methods based on CNN, which extract a hierarchy of nonlinear facial features using multi-layers of convolution and pooling, can achieve higher rates of accuracy on several facial expression benchmarks. Hence, the feature extraction method based on deep learning is better than conventional methods. Accordingly, to enhance the feature extraction effect in a VR environment and improve the accuracy of FER, we developed a novel feature extraction method by optimizing VGG16 (Simonyan and Zisserman 2014).

For feature extraction, a stack of two 3 × 3 convolutional (conv.) layers have an effective receptive field of 5 × 5, and a stack of three 3 × 3 conv. layers have an effective receptive field of 7 × 7. Therefore, using filters with a very small receptive field (3 × 3) reduces the number of hyperparameters of the CNN and leads to better feature extraction results. In addition, few conv. layers lead to a poor performance of the extracted features. By contrast, too many conv. layers cause overfitting. Thus, according to the conv. layers configuration of VGG16 (Simonyan and Zisserman 2014), the architectures of the model used for facial emotion feature extraction in FERVR is shown in Fig. 7, including five convolutional blocks and one fully connected (FC) block. First, 3 × 3 conv. layers were used in the convolution block. Second, rectified linear units were used as activation functions after each convolution calculation was completed to enhance the nonlinear capability. The channel number of the FC is relatively low because FERVR is an 8-classification model. By debugging the parameter, the channels of FC1, FC2, FC3, and FC4 in the FC block were set to 1024, 512, 256, and 256, respectively. Therefore, the fitting speed of the boosting network was reduced by decreasing the number of CNN parameters. A dropout layer with 50% probability of deactivating neurons was introduced after FC3 and FC4 to prevent overfitting of the feature extraction network.

3 System construction

To verify the usability of the VRLC, a verification method was used to conduct evaluation experiments in this study. Thus, considering safety education, a VR elevator safety education system (VRESE) was developed to capture the expression and interaction data required by the system usability evaluation.

3.1 Composition of the VRESE system

The composition of the VRESE is shown in Fig. 8, including the user interface (UI), virtual interaction, and data acquisition and analysis modules.

3.1.1 UI module

The UI module acts as a bridge for exchanging information between VRESE and students, including learning task prompts, height prompts, and remaining time. Learning task prompts provide instructions for learning tasks and interactive operations, which are sequentially triggered according to the operational progress of students. The height prompt appears after the user enters the virtual elevator scene, providing relevant hints for students to make self-help judgments based on the current height and enhancing the immersive experience. The remaining time is displayed when the elevator is about to fall, fully engaging the user's tension by showing the countdown of the crash.

3.1.2 Virtual interaction module

The virtual interaction module of VRESE is responsible for designing human–computer interaction logic to realize real-time interaction in the virtual environment, including operation guidance, teaching, and assessment modes. The operation guidance mode instructs learners to quickly master basic operations such as moving, turning, picking up, and using items in a virtual environment because proper operation guidance can reduce learning costs. In the teaching mode, the instructor’s avatar explains the elevator safety knowledge and guides students to quickly grasp the correct emergency handling methods when the elevator falls. The assessment mode is used to evaluate the students' learning effects. In the assessment mode, interactive operation guidance and learning task reminders are no longer provided by VRESE, and students have to complete the assessment using the elevator safety knowledge they have learned.

3.1.3 Data acquisition and analysis module

Students’ interaction data, including expression data, focused learning time, total time of knowledge explanation, reading completion time of learning task, total time of learning task prompt, and assessment score, are acquired in real time by VRESE. Expression data are acquired in real time by cameras during students' virtual interaction and used as input data for FERVR after data acquisition. In addition to facial expressions, channels such as language, voice, and context can be used to identify the students' learning emotions. However, emotion recognition based on multi-modal data will make the proposed computation of the learning concentration more complicated. Therefore, the algorithm design in this work temporarily does not consider the combined effect of the above factors and only calculates expression scores by collecting data on facial expressions.

In the teaching mode of VRESE, the avatar explains the elevator safety knowledge to demonstrate correct avoidance actions when the elevator falls. Therefore, VRESE uses a ray-based approach to simulate the attention and eyesight of students. An intersection of the ray with the region in which the avatar is located determines whether the focused learning time is begun to be measured. Moreover, the total time of the learning task prompt is derived from the time when the student confirms the learning task prompt in VRESE. The student's score in the assessment mode is set as the assessment score, which is automatically calculated using VRESE according to the assessing standard.

By accumulating the duration of the avatar's skeletal animation and the duration of teaching audio files, the VRESE system counts the total duration of the knowledge explanation and learning task reminder interface after the VR experience is completed. Thereafter, the visual focus rate can be calculated according to the focused learning time and total time of knowledge explanation. Moreover, the reading completion time of learning task and total time of the learning task prompt were used to calculate task mastery.

3.2 Setting the facial expression weight

The characteristics of VR education systems must be analyzed in advance because the concentration of each type of facial expression was different in various VR education systems. Subsequently, the weights of a variety of expressions of the participants in the virtual scenario were determined and used in the calculation of the learning concentration score. By simulating the scene of a malfunctioning elevator sliding down, VRESE enables students to master correct self-rescue skills during an emergency. The external and internal perspectives of the elevator are illustrated in Fig. 9a and b, respectively. As the relative displacement of the sightseeing elevator and other buildings can visually enhance the tension of the participant during the interaction, the observation elevator was regarded as the main interactive scene for safety education training. The observation elevator rises to an altitude of 180 m, which was equivalent to the height of a 45-story building, simulating the process of stalling and falling. By creating a sense of falling and weightlessness through the friction between the car and rails, the VRESE system simulated a situation in which participant’s emotions fluctuate and tests their emergency performance.

Depending on the atmosphere created by the virtual scenario, when students showed expressions such as fear, surprise, sadness, and happiness, they represented their high-valence positive affect toward the current activity and were given a high weighting factor. On the contrary, expressions such as disgust, anger, and contempt indicated that students were dissatisfied with the experiential content of the VRESE system and treated the interaction negatively. Thus, these expressions were categorized as low-valence negative affect of VRESE. Expressions that cannot clearly reflect the level of concentration of students on the experience content were classified as medium valence neutral affect.

In summary, the facial expression weights for VRESE are listed in Table 2. The line represents the expression type, graded valence emotions, and weights. In addition, the weights configuration in Table 2 is only applicable to VRESE. The setting of expression weights in other VR education systems should consider the characteristics of the teaching theme and follow the expression weights configuration rules.

Table 2 Facial expression weight configuration for VRESE

Full size table

4 Evaluation

To measure the usability and reliability of VRLC, using VRESE as an example, we conducted a usability evaluation experiment. The usability evaluation process used in this assessment is illustrated in Fig. 10. First, we evaluated the expression recognition rate of FERVR to ensure the reliability of the expression data. In sequence, we set the assessing standard according to the learning content of the VR education system. Moreover, the participants' expression data were captured by cameras, and the interaction data were collected by VRESE to analyze the learning concentration. After acquiring the data, the learning concentration score was calculated according to the learning concentration formula. Finally, the assessing standard of VRESE was formulated based on the importance of elevator self-help steps. If the overall assessment score meets the assessing standard, the evaluation experiment is completed. Otherwise, the corresponding learning effect optimization strategy is formulated. Thereafter, another cycle of experiments is started until the assessment criteria are met.

The valid range of FERVR expression recognition is between -90°and 90°. However, in order to avoid losing some of the expression data when the head posture changes beyond the range, two image capture devices were deployed at the experimental site, which enabled the range to be extended from -180°to 180°. The position settings of the cameras are shown as solid-line boxes ① and ② in Fig. 11, which were located in the front and at the back of the participants, respectively. Moreover, the resolution of the image capture devices was 1920 × 1080 pixels, and the frame rate was 30 FPS.

4.1 Participants

A total of 103 valid participants, consisting of undergraduate and graduate students recruited from a technical university in China, were enrolled in this experiment. Informed consent was obtained from all voluntary participants prior to the start of the experiment. Basic information such as gender, age, and priori VR knowledge is summarized in Table 3. There were 55 men and 48 women, with a male to female ratio of 1.15:1, and 82 people aged 18–25, accounting for 79.61%. The remaining 20.39% were in the 26–30 age group. Participants with priori knowledge in VR interaction accounted for 44.66%.

Table 3 Demographic details of the participants

Full size table

The participants were divided into 12 groups, with no more than 10 participants in each group. Each participant was allowed to wear an HMD device and experience VRESE using two controllers. Expression and interaction data were collected to examine the usability of the VRLC throughout the participants' entire learning process. In addition, each participant was asked to use the VRESE for no more than 15 min, and it took 20 days to complete data collection.

4.2 Expression recognition rate of FERVR

Before conducting instance validation, the accuracy of FERVR must be checked because it determines the data reliability of VRLC, including dataset preprocessing and network training.

4.2.1 Dataset preprocessing

Images collected from various sources cannot fully meet the experimental requirements, such as the limitations of the image size, color, orientation, or angle. Therefore, preprocessing operations are necessary, including face detection, segmentation, graying, and normalization, to eliminate the interference of non-expression features and make the expression image as clear and rich as possible.

In view of the fact that published datasets for facial recognition in VR environments are relatively few, existing datasets should be processed to meet the requirements. As shown in Fig. 12a, RaFD (Langner et al. 2010) with 8040 facial expression images contains 67 performers of different ages, genders, and skin tones in eight types of expressions (happiness, sadness, disgust, surprise, fear, anger, contempt, and expressionless) and five types of postures (− 90°, − 45°, 0°, 45°, and 90°), which can meet the training requirements of the robustness of head poses. However, facial images from this dataset are not applicable to FER in VR scenarios because the influence of HMD occlusion on face recognition has not been considered. To solve this problem, the HMD occlusion mask was added to the image from RaFD during data preprocessing. In addition, the pitch angle of the head in the picture from RaFD is 0°, and we set a pitch angle of 0° as a qualifying condition for using FERVR.

The dataset preprocessing steps are shown in Fig. 13, including face detection, HMD occlusion simulation, graying, and normalization.

(1)
Step 1: By performing facial detection on the images from the dataset, the face region is preserved to remove irrelevant background information from the FER features.
(2)
Step 2: As shown in Fig. 12b, the dataset suitable for VR interaction (SRaFD) is generated by adding HMD occlusion masks of different angles to the eye positions of the image from RaFD.
(3)
Step 3: To reduce the system overhead, images from the SRaFD are grayed to improve the network training efficiency.
(4)
In the last step, the gray images are normalized to obtain face images of the same scale.

4.2.2 Network training

Network training of FERVR was performed after dataset preprocessing. First, the data enhancement method of random cropping and filling with a small-scale occlusion mask was used to expand the image quantity of the dataset. Thereafter, the dataset was stochastically divided into training and testing datasets using the Python programed data splitting method to reduce the influence of human factors on dataset classification. Finally, the processed data were fed into the FERVR for training. Network initialization used default parameters; the learning rate was set to 0.001; the momentum was set to 0.9; the stochastic gradient descent was set as the optimizer. After all the parameters of this network were set, the FERVR was trained for 200 epochs.

The expression recognition accuracy of FERVR should be evaluated in real time during network training. Moreover, the hyper-parameter can be optimized iteratively based on the evaluation results to improve the recognition accuracy. One of the metrics used to evaluate network recognition effectiveness is the loss value (Checa and Bustillo 2020), which evaluates the degree of error between the output value of FERVR and the real value of the dataset. The lower the loss value, the better the robustness of this method.

The change curves of the loss value and FER accuracy during the FERVR training process are shown in Figs. 14 and 15, respectively. Initially, during network training, a loss value above 2.00 was achieved. Thereafter, the loss value gradually decreased with the increase in epochs and stabilized at approximately 0.1 after 150 epochs. Conversely, the accuracy rate was only 0.2 in the early training period, and the accuracy increased gradually with the increase in epochs. When the epoch reaching150 iterations, the changing tendency of the curve became stable. Moreover, the accuracy rate was high because the loss value was small.

The confusion matrix (Mohammed and Al-Ani 2020) allows recognition performance according to each label (in our case, each emotion). Using the confusion matrix, the recognition rate of each emotion can be analyzed, and the easiest and most difficult emotions can be recognized. The diagonal line in the confusion matrix represents the average accuracy of each class of expressions, and the remainder of the results indicate confusion with other expressions. A comparison of the confusion matrices is presented in Fig. 16. Figure 16a and b shows the confusion matrices for the eight emotions of RaFD and SRaFD, respectively. The horizontal axis in Fig. 16 indicates the predicted class from the FERVR, the vertical axis represents the true class of RaFD or SRaFD, and the color depth indicates the degree of FER accuracy.

As shown in Fig. 16a, the accuracy rates of recognizing fear, disgust, anger, happiness, expressionless, and contempt were all above 0.92. The facial expression characteristics of sadness and disgust are occasionally similar, and 14% of sadness was considered as disgust; therefore, the accuracy of sadness was relatively low. Similarly, 5% of surprise was predicted as happiness; thus, the recognition rate of the surprise expression was 0.86. Accordingly, the average accuracy of FERVR for RaFD reached 92.75%. In addition, as shown in Fig. 16b, the accuracy rates of anger, delight, expressionless, and contempt were above 0.90. In particular, the accuracy rate of expressionless reached 0.99, and those of fear, disgust, sadness, and surprise were approximately 0.85. In summary, the average accuracy of FERVR for SRaFD reached 0.9004, which is 20.65% higher than that of FER based on LeNet for VR interaction.

By comparing Fig. 16a and b, the overall FER rate decreased by 2.71% in the VR interaction environment because the features of the eyes, eyebrows, and parts of the nose were obscured by the HMD. In particular, the recognition rate of fear and disgust decreased by approximately 10% because the expression of fear was confused with sadness and surprise without the above key features. Moreover, the probabilities of judging fear as sadness and surprise were 6% and 5%, respectively, and the probability of judging disgust as sadness was 9%.

The multi-angle expression recognition accuracy results for RaFD are summarized in Table 4, and in multi-angle FER, 0° was set as having no deflection; − 45° and − 90° (45° and 90°) were set as 45° and 90° head deflection to the left (right), respectively. From the results in the table, the average accuracy of eight types of expressions at 0° was above 0.95, indicating that FERVR has an ideal recognition effect for various facial expressions. Among them, the accuracy rates of anger, happiness, and expressionless reached 0.99, as the expression features of the forward direction were the most abundant. When the head deflection angle reached 45°, most of the effective facial features could also be extracted. Compared with 0° head deflection angle, the effect of FER slightly decreased. When the head deflection angle reached 90° and − 90°, the recognition accuracy of all types of expressions was significantly reduced owing to the sharp reduction in effective expression information. Particularly, the recognition accuracy of fear, disgust, sadness, and surprise decreased by approximately 10%.

Table 4 Multi-angle expression recognition accuracy results for RaFD

Full size table

The results of expression recognition using deep learning methods include true positive, true negative, false positive and false negative (Chicco et al. 2021). The ones that the algorithm correctly identified as positive are called true positives, while those wrongly classified as negative are labeled false negatives. On the other side, the negative elements that are correctly labeled negative are called true negatives, while those which are wrongly predicted as positives are called false positives (Chicco et al. 2021). Therefore, the results of the expression recognition will influence the expression score.

First, a true positive or true negative result for expression recognition indicates that the result is correct for this time. Second, if the expressions that occurred as false positive or false negative belonged to the same graded valence emotion, they have no effect on the overall expression score. Finally, as shown in Table 5, if the expressions in which false positives or false negatives occurred do not belong to the same graded valence emotion, the expression score change tendencies should be determined according to the graded valence emotions and recognition results.

Table 5 Effect of expression recognition results on expression scores

Full size table

The accuracy comparison of the different FER methods for RaFD is presented in Table 6. Specifically, the conventional FER of sLDA and multi-SVM has significant shortcomings in real time, accuracy, and robustness. Many researchers have used CNN algorithms for FER (Liong 2020; Zhang et al. 2019; You et al. 2020), enabling computers to read meanings expressed in face images more quickly and accurately. Dense-Net121, Res Net50, VGG16, and VGG19Net are FER methods based on deep learning that can solve the shortcomings of conventional expression recognition methods. Accordingly, the deep CNN model can effectively extract features from data, which is beyond the range of many machine learning recognition algorithms.

Table 6 Accuracy comparison of different FER methods for RaFD

Full size table

However, if these methods are not improved according to the characteristics of the recognized objects, overfitting may occur because of the high number of convolutional layers in the methods. As presented in Table 6, the average accuracy of the FERVR for RaFD was 92.75%. Compared with sLDA and multi-SVM, the average FER rate of FERVR for this dataset improved by 29.45% and 26.62%. Compared with Dense-Net121, Res Net50, and VGG16, the average accuracy of FERVR for RaFD was improved by 5.03% to 25.21%.

4.3 Assessing standard setting

The interaction design of VRESE referenced the self-rescue steps in the event of an elevator falling, as proposed by Shi et al. (2019). First, when an elevator falls, participants should quickly press as many floor buttons as possible to improve the possibility of preventing the elevator from falling. The participant’s back should be close to the elevator wall to protect the spine. Finally, they should cover their heads with both hands, bend their knees, and point their toes to the ground to relieve the impact of the fall.

The assessing standard of VRESE that can directly evaluate the learning effect of participants is summarized in Table 7, and the participant’s interactive behavior was key to this assessment. First, depending on the importance of each self-rescue step, participants should press at least seven floor buttons. Second, the participants should stand in the safe area of the elevator. Additionally, at least one protective action must be taken. Accordingly, the full VRESE score was set to 100, and the approval threshold was 85 points.

Table 7 Assessing standard of VRESE

Full size table

4.4 Results analysis

The usability evaluation results of VRLC are presented in Fig. 17, the average learning concentration score was 0.63, and the average values for the expression score, visual focus rate, and task mastery were 0.69, 0.62, and 0.57, respectively. Relatively, the expression score was higher than the visual focus rate and task mastery scores, indicating that participants were attracted by the experience content of VRESE; thus, high-valence positive affect appeared more frequently than low-valence negative affect. However, the low visual focus rate shows that the time for participants to focus on learning content was insufficient in the teaching mode. Therefore, some key points of knowledge were neglected. Similarly, a low degree of task mastery means that participants had an insufficient understanding of the VRESE learning task, leading to cognitive errors in the follow-up operation.

The distribution of the learning concentration and assessment scores is shown in Fig. 18, thirty-four participants met the assessing standards, and their average learning concentration score reached 0.80. On the contrary, the assessment score of participants whose learning concentration scores were less than 0.6 was apparently low. However, although the learning concentration scores of some participants were high, their assessment scores were low. For instance, the participant with ID number 16 had a learning concentration score of 0.88; however, the assessment score was only 65. This is significantly inconsistent with the expectation that a higher learning concentration will lead to a higher assessment score. In response to this issue, we concluded that this participant was unable to complete the tasks in the assessment section because it did not fully understand the operation of the system. Accordingly, owing to the overall low concentration of learning, the average score of the VRESE system was only 70.10. This indicates that the results of the first usability assessment did not meet the assessing standards. As for a few individual abnormal results, as the operation method of the VRESE system has not been fully mastered by some participants, low assessment results were obtained. In summary, deploying optimization strategies based on the analysis results of the learning concentration score can improve the learning effect and interaction experience.

The following two learning effect optimization strategies were formulated by analyzing the results of the learning concentration evaluation:

4.4.1 Psychological counseling

Aiming at participants with low concentration scores, according to cognitive comprehension and cognitive behavior therapies (Sarioglan 2020), we guided them to realize the importance of elevator safety self-help knowledge by playing elevator accident videos. Therefore, the learning state of these participants was adjusted, and they actively engaged in the learning of elevator safety-related knowledge.

4.4.2 Optimization of the system interaction design

In response to the problems in interaction design, scene realism and the interaction mechanism of the VRESE system were comprehensively optimized, including the following:

(1)
To alleviate dizziness, the movement mode in the virtual scene was changed from walking to curve blinking;
(2)
On the premise of a smooth picture, the camera in the virtual scene was added with rending filters to enhance the reality of the virtual scene;
(3)
A new scene that can be experienced multiple times was created for learning the system operation method to ensure that the basic interactive operation was mastered by participants. Furthermore, by reducing the size of the confirmation button on the task reminder prompt, the problem of the task reminder prompt being easily closed by incorrect manipulation was solved. Thus, the completion time of task reminder reading was counted more accurately;
(4)
When the avatar explained the knowledge points of elevator safety, a number of 3D arrows were added to guide participants to turn their attention to the presentation area of the knowledge points. Therefore, the visual focus rate increased.

Applying the same evaluation process as the first test, the second usability evaluation experiment was conducted by performing a 32-day experiment using a learning effect optimization strategy. The results of the second evaluation experiment are presented in Fig. 19. Overall, the average values of the expression score, visual focus rate, and task mastery were 0.79, 0.83, and 0.85, respectively. Specifically, the average expression score was 0.79, which shows that the proportion of high-valence positive affect was significantly higher than that of low-valence negative affect. Participants were satisfied with the experience content, and consequently, they were immersed in the elevator scene created by VRESE. The average visual focus rate was 0.83, indicating that 83% of participants' attention was focused on the presentation area of the knowledge points. Thus, the focused learning time was high. The average task mastery score reached 0.85. Therefore, the proportion of the learning task reminder prompt read by the participants reached 85%, and the correct self-help operation could be completed in the follow-up operation. The distribution of the learning concentration and assessment scores in the second experiment is shown in Fig. 20; Seventy-five participants met the assessing standards and their average learning concentration score reached 0.86. Accordingly, as more attention was focused on the learning content of VRESE, the learning concentration score reached 0.81. As a result, the average assessment score reached 85.49 and met the assessing standard of VRESE. Thus, this usability evaluation experiment was terminated.

5 Discussion

As mentioned earlier, the learning concentration of the participants and examination results have been significantly improved. A comparison result of two experiments is presented in Fig. 21a. Compared with the first experiment, the average expression score, visual focus rate, and task mastery in the second experiment increased by 10%, 21%, and 28%, respectively. Thus, the learning concentration of the participants on the learning content of the VRESE was significantly improved by implementing the learning effect optimization strategy. Therefore, the average learning concentration score in the second experiment was 0.81, consisting of an increase of 18% compared to that of the first experiment. The average assessment scores of the two experiments are shown in Fig. 21b. The average assessment score of the second experiment reached 85.49, which is 15.39% higher than that of the first experiment. Consequently, in head-mounted VR interaction, learning concentration is an essential factor for improving the learning effect. Moreover, according to the learning concentration score, the corresponding optimization strategy can effectively improve the participants' learning effects.

In this study, we proposed a VRLC with expression score, visual focus rate, and task mastery as evaluation indicators to improve the learning effect and interactive experience of virtual interaction. The results of this study have reference value for VR education researchers and VR system developers.

For the aforementioned personnel involved, the optimization of interactive experience is a problem that needs to be addressed. VRLC can help to analyze students’ learning behavior during VR interaction and provide objective data for the optimization of interactive experiences. First, VRLC can adjust the expression weights according to the teaching content and characteristics of the VR education system to fit most VR education topics. In addition, the evaluation indicators of VRLC can quickly locate the deficiencies of the VR education system in the interaction design and formulate optimization strategies based on the analysis results of the learning concentration to improve the learning effect and interactive experience.

6 Conclusion

As head-up rate and FER results are used as evaluation metrics, conventional concentration analysis methods are not suitable for VR interaction. Solutions based on sensing devices and interaction data have been proposed to analyze learning concentration. However, these methods provide users with uncomfortable experiences. First, the recognition rate of the sensing-device-based method is relatively low. In addition, the method based on interaction data has specific evaluation populations and poor scalability of the evaluation indexes.

To promote the learning effect and interactive experience, we proposed a VRLC method to solve the above problems. Depending on the different characteristics of the types of VR education applications, the learning concentration scores of students for head-mounted VR interaction can be calculated by adjusting expression weights. The expression score, visual focus rate, and task mastery were used as the evaluation indicators for measuring the learning concentration. The expression score effectively evaluates the students' concentration in VR learning content. The proportion of focus on learning and degree of knowledge mastery were evaluated using the visual focus rate and task mastery, respectively. Therefore, the analysis result of virtual learning concentration provides a comprehensive and objective basis for formulating learning effect optimization strategies.

The results of the evaluation experiment showed that the learning concentration can be effectively estimated. By formulating and implementing the corresponding learning effect optimization strategy, the learning concentration of students increased by 18%. The substantial increase in learning concentration reflects that students can better immerse themselves in interactive scenes and focus on the learning content, and knowledge from VRESE can be mastered. Accordingly, the average assessment score improved by 15.39%. The experience results indicated that the learning concentration is an essential influencing factor for measuring the VR learning effect, and the learning effect and interactive experience can be effectively improved by formulating and implementing a corresponding optimization strategy.

To obtain the FER data required to calculate the learning concentration score, we proposed FERVR for head-mounted VR interaction. By simplifying the attention mechanism, FERVR reduces the influence of HMD occlusions on FER results. In addition, by fusing global and local features, robustness was improved in the presence of a larger adjustment range of head posture. The FER evaluation results indicated that the FER rate of FERVR for RaFD reached 92.75%. Compared to the conventional FER methods of sLDA and multi-SVM, FERVR achieved 29.45% and 26.62% higher average accuracies, respectively. Compared with other FER methods based on deep learning, including Dense-Net121, Res Net50, and VGG16, FERVR achieved 5.03% to 25.21% higher average accuracy. In addition, the average accuracy of FERVR for SRaFD was 90.04%. Compared with the methods of FER based on LeNet for VR, the FER rate increased by 20.65%. Consequently, the data reliability of the VRLC was significantly enhanced.

Currently, VRLC should be further improved in two aspects. First, FERVR relies on local features that are not occluded by an HMD for recognizing facial expressions. When users bow their heads, their facial features may be completely occluded by the HMD. In this case, the expression is not accurately recognized by FERVR, accordingly some of the students’ expression data will be lost. Therefore, considering that the effect of pitch angle on expression recognition will be a matter of concern in the next step of our research. Second, the calculation of learning concentration also requires consideration of joint effects of language, voice, and the context in learning emotion recognition. The reliability of the calculation results can be effectively improved by constructing a multi-modal emotion recognition model. Finally, in the usability evaluation process, it was established that the learning concentration was also affected by some hardware factors. For instance, some students are unable to adapt to the dizziness caused by long-term virtual interaction, resulting in a decline in concentration learning. Accordingly, our next research goal is to determine whether hardware factors need to be included as an evaluation indicator for measuring learning concentration and to quantify the impact of hardware factors on concentration in virtual learning.

Availability of data and materials

All data generated or analyzed during this study are included in this study.

Code availability

Code generated or used during the study is available from the corresponding author by request.

References

Abdullah J, Mohd-Isa WN, Samsudin MA (2019) Virtual reality to improve group work skill and self-directed learning in problem-based learning narratives. Virtual Real 23(4):461–471. https://doi.org/10.1007/s10055-019-00381-1
Article Google Scholar
Ainley M (2012) Students’ interest and engagement in classroom activities. In: Handbook of research on student engagement, Springer. pp 283–302). https://doi.org/10.1007/978-1-4614-2018-7_13
Alemdag E, Cagiltay K (2018) A systematic review of eye tracking research on multimedia learning. Comput Educ 125:413–428. https://doi.org/10.1016/j.compedu.2018.06.023
Article Google Scholar
Arana-Llanes JY, Gabriel GS, Rodrigo PT et al (2018) EEG lecture on recommended activities for the induction of attention and concentration mental states on e-learning students. J Intell Fuzzy Syst 34(5):3359–3371. https://doi.org/10.3233/JIFS-169517
Article Google Scholar
Arya R, Singh J, Kumar A (2021) A survey of multidisciplinary domains contributing to affective computing. Comput Sci Rev 40(3):100399. https://doi.org/10.1016/j.cosrev.2021.100399
Article Google Scholar
Boutefara T, Mahdaoui L (2020) Using holonic multi-agent architecture to deal with complexity in multi-modal emotion recognition. In: 2020 International Conference on Advanced Aspects of Software Engineering (ICAASE).
Castelló A, Chavez D, Cladellas R (2020) Association between slides-format and Major’s contents: effects on perceived attention and significant learning. Multimedia Tools Appl 79(33):24969–24992
Article Google Scholar
Checa D, Bustillo A (2020) A review of immersive virtual reality serious games to enhance learning and training. Multimedia Tools Appl 79(9):5501–5527. https://doi.org/10.1007/s11042-019-08348-9
Article Google Scholar
Chi MTH, Wylie R (2014) The ICAP framework: Linking cognitive engagement to active learning outcomes. Educ Psychol 49(4):219–243. https://doi.org/10.1080/00461520.2014.965823
Article Google Scholar
Chicco D, Ttsch N, Jurman G (2021) The matthews correlation coefficient (mcc) is more reliable than balanced accuracy, bookmaker informedness, and markedness in two-class confusion matrix evaluation. BioData Min. https://doi.org/10.1186/s13040-021-00244-z
Article Google Scholar
Christenson, S. L., Reschly, A. L., & Wylie, C. (2012). The Relations of Adolescent Student Engagement with Troubling and High-Risk Behaviors. Handbook of research on student engagement pp563–584. Springer. https://doi.org/10.1007/978-1-4614-2018-7_27
Book Google Scholar
D’Mello S, Graesser A (2012) Dynamics of affective states during complex learning. Learn Instr 22(2):145–157
Article Google Scholar
D’Mello S (2017). In: Lang C, Siemens G, Wise AF, Gaˇsevic D (Eds.), Emotional learning analytics , pp 115–127
Deci EL, Ryan RM (1985) Intrinsic motivation and self-determination in human behavior. Plenum. https://doi.org/10.2307/2070638
Article Google Scholar
Fredricks JA, Mccolskey W (2012) The measurement of student engagement: a comparative analysis of various methods and student self-report instruments. Springer, US
Google Scholar
Fredricks JA, Blumenfeld PC, Paris A (2004) School engagement: Potential of the concept: State of the evidence. Rev Educ Res 74:59–119. https://doi.org/10.3102/00346543074001059
Article Google Scholar
Greene BA (2015) Measuring cognitive engagement with self-report scales: reflections from over 20 years of research. Educ Psychol 50(1):14–30
Article Google Scholar
Guo G, Zhang N (2019) A survey on deep learning based face recognition. Comput vis Image Underst 189:102805. https://doi.org/10.1016/j.cviu.2019.102805
Article Google Scholar
Guo X, Zhou J, Xu T (2018) Evaluation of teaching effectiveness based on classroom micro-expression recognition. Int J Perform Eng 14(11):2877–2885. https://doi.org/10.1103/PhysRevPhysEducRes.14.010129
Article Google Scholar
Gupta A, Elby A, Danielak BA (2018) Exploring the entanglement of personal epistemologies and emotions in students’ thinking. Phys Rev Phys Educ Res 14(1):010129. https://doi.org/10.1103/PhysRevPhysEducRes.14.010129
Article Google Scholar
Guthrie JT, Wigfield A, Barbosa P, Perencevich KC, Tonks S (2004) Increasing reading comprehension and engagement through concept-oriented reading instruction. J Educ Psychol 96(3):403–423
Article Google Scholar
Jiao P, Guo X, Jing X et al (2021) Temporal network embedding for link prediction via VAE joint attention mechanism. IEEE Trans Neural Netw Learn Syst. https://doi.org/10.1109/TNNLS.2021.3084957
Article Google Scholar
Käser D, Parker E, Glazier A, et al (2017) The making of Google earth VR[C]: ACM SIGGRAPH 2017 Talks, Los Angeles, California. ACM
Kht A, Jch A., Crt B, Czl C, Yhh D (2022) Virtual reality for car-detailing skill development: Learning outcomes of procedural accuracy and performance quality predicted by VR self-efficacy, VR using anxiety, VR learning interest and flow experience - ScienceDirect
Kim YSG, Schatschneider C (2017) Expanding the developmental models of writing: a direct and indirect effects model of developmental writing (DIEW). J Educ Psychol 109(1):35–50. https://doi.org/10.1037/edu0000129
Article Google Scholar
Kim J, Merrill K, Xu K et al (2020) My teacher is a machine: understanding students’ perceptions of ai teaching assistants in online education. Int J Hum Comput Interact 36(20):1902–1911
Article Google Scholar
Krejtz K, Duchowski AT, Krejtz I, Kopacz A, Chrząstowski-Wachtel P (2016) Gaze transitions when learning with multimedia. J Eye Mov Res. https://doi.org/10.16910/jemr.9.1.5
Article Google Scholar
Kruger J-L, Doherty S (2016) Measuring cognitive load in the presence of educational video: Towards a multimodal methodology. Aust J Educ Technol. https://doi.org/10.14742/ajet.3084
Article Google Scholar
Langner O, Dotsch R, Bijlstra G et al (2010) Presentation and validation of the radboud faces database. Cogn Emot 24(8):1377–1388
Article Google Scholar
Li Y, Zeng J, Shan S et al (2019) Occlusion aware facial expression recognition using CNN with attention mechanism. IEEE Trans Image Process 28(5):2439–2450. https://doi.org/10.1109/TIP.2018.2886767
Article MathSciNet Google Scholar
Liong ST, Gan YS, Zheng D et al (2020) Evaluation of the spatio-temporal features and gan for micro-expression recognition system. J Signal Process Syst 92(7):705–725. https://doi.org/10.1007/s11265-020-01523-4
Article Google Scholar
Liu Z, Wang Z (2017). The empirical study of behavior engagement influence on deep learning: exemplified with video learning in virtual reality (Vr) environment. https://doi.org/10.15881/j.cnki.cn33-1304/g4.2017.01.008
Liu H-C, Chuang H-H (2011) An examination of cognitive processing of multimedia information based on viewers’ eye movements. Interact Learn Environ 19(5):503–517. https://doi.org/10.1080/10494820903520123
Article Google Scholar
Mahmoudi MA, Chetouani A, Boufera F et al (2020) Learnable pooling weights for facial expression recognition. Pattern Recogn Lett 138:644–650. https://doi.org/10.1016/j.patrec.2020.09.001
Article Google Scholar
Maraza QB, Alejandro OOM, Choquehuanca QW et al (2020) Towards a standardization of learning behavior indicators in virtual environments. Int J Adv Comput Sci Appl 11(11):146–152. https://doi.org/10.14569/IJACSA.2020.0111119
Article Google Scholar
Meece J, Blumenfeld PC, Hoyle RH (1988) Students’ goal orientation and cognitive engagement in classroom activities. J Educ Psychol 80:514–523
Article Google Scholar
Mitruţ O, Moise G, Petrescu L, Moldoveanu A, Leordeanu M, Moldoveanu F (2019) Emotion classification based on biophysical machine learning techniques. Symmetry 12(21):21
Google Scholar
Mohammed BA, Al-Ani MS (2020) An efficient approach to diagnose brain tumors through deep CNN. Math Biosci Eng MBE 18:851–867. https://doi.org/10.3934/mbe.2021045
Article MATH Google Scholar
Parong J, Mayer RE (2021) Cognitive and affective processes for learning science in immersive virtual reality. J Comput Assist Learn 37(1):226–2411
Article Google Scholar
Pedram S, Palmisano S, Skarbez R et al (2020) Investigating the process of mine rescuers’ safety training with immersive virtual reality: A structural equation modelling approach. Comput Educ. https://doi.org/10.1016/j.compedu.2020.103891
Article Google Scholar
Pekrun R, Perry, RP (2014) Control-value theory of achievement emotions. In: International handbook of emotions in education. pp 120–141. https://doi.org/10.4324/97802 03148 211.ch7. Routledge.
Pekrun R, Linnenbrink-Garcia L (2012) Handbook of research on student engagement. In: Christenson ARS, Wylie C (eds) Academic emotions and student engagement. Springer, US, pp 259–282
Google Scholar
Qi M, Wang Y, Qin J et al (2020) stagNet: An attentive semantic RNN for group activity and individual action recognition. IEEE Trans Circuits Syst Video Technol 30(2):549–565. https://doi.org/10.1109/TCSVT.2019.2894161
Article Google Scholar
Rajan S, Chenniappan P, Devaraj S, Madian N (2019) Facial expression recognition techniques: a comprehensive survey. IET Image Process 13(7):1031–1040. https://doi.org/10.1049/iet-ipr.2018.6647
Article Google Scholar
Renninger KA, Hidi S (2016) The power of interest for motivation and engagement. Routledge. https://doi.org/10.4324/9781315771045
Article Google Scholar
Russell JA, Barrett LF (1999) Core affect, prototypical emotional episodes, and other things called emotion: Dissecting the elephant. J Personal Soc Psychol 76(5):805–819
Article Google Scholar
Sarioglan ABI (2020) Investigated effects of guided inquiry-based learning approach on students’ conceptual change and durability. Cypriot J Educ Sci 15(4):674–685
Article Google Scholar
Shen CW, Ho JT, Ly P et al (2019) Behavioural intentions of using virtual reality in learning: perspectives of acceptance of information technology and learning style. Virtual Real 23(3):313–324. https://doi.org/10.1007/s10055-018-0348-1
Article Google Scholar
Shete PC, Ansari ZN, Kant R (2020) A Pythagorean fuzzy AHP approach and its application to evaluate the enablers of sustainable supply chain innovation. Sustain Prod Consum 23:77–93
Article Google Scholar
Shi G, Li G, Zhu Z et al (2019) A virtual experiment for partial space elevator using a novel high-fidelity FE model. Nonlinear Dyn 95(4):2717–2727
Article Google Scholar
Shi Y (2020) Research on evaluation model of classroom attention of students based on face recognition technology. In: Dissertation, Central China Normal University.
Simonyan K, Zisserman A (2014) Very Deep Convolutional Networks for Large-Scale Image Recognition. In: Computer Science.
Sinatra GM, Heddy BC, Lombardi D (2015) In The challenges of defining and measuring student engagement in science , vol 50, Routledge.
Skinner E (2016) Handbook of motivation at school. Routledge
Google Scholar
Skinner E, Furrer C, Marchand G, Kindermann T (2008) Engagement and disaffection in the classroom: Part of a larger motivational dynamic? J Educ Psychol 100(4):765–781. https://doi.org/10.1037/a0012840
Article Google Scholar
Suhaimi NS, Mountstephens J, Teo J (2020) Parameter tuning for enhancing inter-subject emotion classification in four classes for vr-eeg predictive analytics. Int J Adv Sci Technol 29(6):1483–1491
Google Scholar
Sutjarittham T, Gharakheili HH, Kanhere SS et al (2019) Experiences with IoT and AI in a smart campus for optimizing classroom usage. IEEE Internet Things J 6(5):7595–7607. https://doi.org/10.1109/JIOT.2019.2902410
Article Google Scholar
Teng T (2017) Facial expressions recognition based on convolutional neural networks for mobile virtual reality. In: Dissertation, Shanghai Jiao Tong University.
Thomas AW, Molter F, Krajbich I, Heekeren HR, Mohr PN (2019) Gaze bias differences capture individual choice behaviour. Nat Hum Behav 3(6):625–635. https://doi.org/10.1101/228825
Article Google Scholar
Tsai CW, Shen PD, Chiang IC (2020) Investigating the effects of ubiquitous self-organized learning and learners-as-designers to improve students’ learning performance, academic motivation, and engagement in a cloud course. Univ Access Inf Soc 19(1):1–16. https://doi.org/10.1007/s10209-018-0614-8
Article Google Scholar
Wong YK (2018) Exploring the reading-writing relationship in young Chinese language learners’ sentence writing. Read Writ 31:945–964
Article Google Scholar
Wu T (2019) Expression Recognition based on the restoration of occluded face images in Vr scenarios. In: Dissertation, South China University of Technology.
Yeh SC, Lin SY, Wu HK et al (2020) A virtual-reality system integrated with neuro-behavior sensing for attention-deficit/hyperactivity disorder intelligent assessment. IEEE Trans Neural Syst Rehabil Eng 28(9):1899–1907. https://doi.org/10.1109/TNSRE.2020.3004545
Article Google Scholar
You M, Han X, Xu Y et al (2020) Systematic evaluation of deep face recognition methods. Neurocomputing 388:144–156
Article Google Scholar
Zangeneh Soroush M, Maghooli K, Setarehdan SK, Nasrabadi AM (2018) A novel approach to emotion recognition using local subset feature selection and modified Dempster-Shafer theory. Behav Brain Funct. https://doi.org/10.1186/s12993-018-0149-4
Article Google Scholar
Zhang FF, Zhang TZ, MAO QR (2019) Multi-pose facial expression recognition via generative adversarial network. Chin J Comput 42(120):1–16
Google Scholar

Download references

Funding

This work is funded by Program of Study Abroad for Young Scholar sponsored by CSC (201806655029) and Educational Research Project for Young Teachers of The Education Department of Fujian Province, China (JAT200029).

Author information

Authors and Affiliations

College of Physics and Information Engineering, Fuzhou University, Fuzhou, 350018, China
Yi Lin, Yangfan Lan & Shunbo Wang

Authors

Yi Lin
View author publications
You can also search for this author in PubMed Google Scholar
Yangfan Lan
View author publications
You can also search for this author in PubMed Google Scholar
Shunbo Wang
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

YL and YL contributed to conceptualization, methodology, formal analysis and investigation, writing—original draft preparation, and writing—review and editing. YL involved in funding acquisition. YL and SW contributed to resources. YL involved in supervision.

Corresponding author

Correspondence to Yangfan Lan.

Ethics declarations

Conflicts of interest

The authors declare that they have no conflict of interest.

Consent to participate

Written informed consent was obtained from individual or guardian participants.

Consent for publication

Written informed consent for publication was obtained from all participants.

Ethics approval

All procedures performed in this study involving human participants were in accordance with the ethical standards of the institutional and/or national research committee, and with the 1964 Helsinki declaration and its later amendments or comparable ethical standards. This article does not contain any studies with animals performed by any of the authors. Informed consent was acquired from all individual participants included in the study.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Lin, Y., Lan, Y. & Wang, S. A method for evaluating the learning concentration in head-mounted virtual reality interaction. Virtual Reality 27, 863–885 (2023). https://doi.org/10.1007/s10055-022-00689-5

Download citation

Received: 05 November 2021
Accepted: 17 August 2022
Published: 27 September 2022
Issue Date: June 2023
DOI: https://doi.org/10.1007/s10055-022-00689-5

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

A method for evaluating the learning concentration in head-mounted virtual reality interaction

Abstract

Similar content being viewed by others

A novel method for improving the perceptual learning effect in virtual reality interaction

Investigating the Relationship Between Students’ Preferred Learning Style on Their Learning Experience in Virtual Reality (VR) Learning Environment

Learning Through Immersion : Assessing the Learning Effectiveness

Explore related subjects

1 Introduction

2 Method

2.1 Calculation of the learning concentration score

2.1.1 Evaluation indicators for measuring the learning concentration

2.1.1.1 Emotional focus and expression score

2.1.1.2 Behavioral focus and visual focus rate

2.1.1.3 Cognitive focus and task mastery

2.1.2 Learning concentration calculation

2.2 Research of the FER method

2.2.1 FERVR framework

2.2.2 Feature extraction

3 System construction

3.1 Composition of the VRESE system

3.1.1 UI module

3.1.2 Virtual interaction module

3.1.3 Data acquisition and analysis module

3.2 Setting the facial expression weight

4 Evaluation

4.1 Participants

4.2 Expression recognition rate of FERVR

4.2.1 Dataset preprocessing

4.2.2 Network training

4.3 Assessing standard setting

4.4 Results analysis

4.4.1 Psychological counseling

4.4.2 Optimization of the system interaction design

5 Discussion

6 Conclusion

Availability of data and materials

Code availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflicts of interest

Consent to participate

Consent for publication

Ethics approval

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation