1 Introduction
Collaborative learning is a social system in which groups of learners solve problems or construct knowledge by working together [
4], which is a vital skill in today’s knowledge-driven society and stands out as a widely embraced educational approach [
25]. Understanding and studying interactions in collaborative learning has gained significant interest across diverse disciplines [
20,
64,
75] because it enables us to improve teaching methods, learning outcomes, and the overall educational experience for students. Moreover, collaborative learning can be adopted in professional settings such as business environments, corporate training programs, and research endeavors. Therefore, achieving success in collaborative learning holds significant importance.
Recent research underscores the crucial role of
socially shared regulation in learning (SSRL) as a determinant for successful collaborative learning initiatives [
27]. SSRL involves regulating group behavior and refining shared understanding through negotiation, integral for effective collaborative learning, shaping group dynamics, and enhancing shared knowledge construction [
27]. The focal point of SSRL lies in the interplay between cognitive and socio-emotional interactions, recognized as primary mechanisms for facilitating group regulation in collaborative learning environments [
31]. In this article, our objective is to collect a multimodal dataset of cognitive and socio-emotional interactions for the in-depth SSRL study, denoted as
multimodal dataset of socially shared regulation in learning (MSSRL). The construction of the dataset requires collaborative efforts from interdisciplinary fields including learning science and computer science researchers. Learning science researchers design collaborative tasks to collect cognitive and socio-emotional interactions, while computer science researchers allocate multiple sensors to gather multimodal interactions. This constructed interdisciplinary dataset can facilitate studies in computer science, learning science, and social science, as shown in
Figure 1.
The primary challenge in collecting the SSRL dataset lies in the transient, dynamic, and infrequent nature of cognitive and socio-emotional interactions among group members [
2,
58,
83]. It is difficult to collect pertinent data on these interactions associated with SSRL, hindering the establishment and evaluation of its measurement for the design of situated support. To address this challenge, we designed a special collaborative learning task with regulatory trigger events to induce cognitive and socio-emotional interactions among triadic people for further SSRL study. In learning sciences, the trigger events have been regarded as challenging events and/or situations that may hinder progress in collaboration [
17,
33]. With the trigger events, we illustrate real-world situations that require appropriate and strategic response and adaptation in the regulation of cognition, emotion, motivation, and behavior; thus, they should generalize to such scenarios. One example of this scenario is that group members struggle to construct shared knowledge when realizing a misunderstanding of task instructions (
cognitive trigger (CT)). This situation would require appropriate regulation of cognition from the group members to achieve a shared understanding of the given task. Another example could be a malfunctioning tool with a looming deadline (
emotional trigger (ET)) which may challenge group members. Efficient time use requires the regulation of emotions, particularly in navigating challenges. The concept of triggers offers a useful framework for investigating how regulation in collaborative and individual learning scenarios occurs. It allows researchers and educators to identify specific instances where self-regulation or co-regulation could be activated, adjusted, or maintained, thereby providing valuable insights into the dynamic nature of regulatory processes in learning.
In this study, we gathered data to investigate SSRL using a simulated smoothie bar task, focusing on ensuring safe smoothie preparation for an allergic customer. The CT, “the customer has allergies,” initiates shared understanding among team members, aided by SSRL, ensuring allergen avoidance. In addition, ET like “hurry up” caused by long queues induce SSRL to manage stress and maintain the balance between speed and safety through effective communication. SSRL also regulated group behavior, coordinating actions to prevent cross-contamination through practices like blender cleaning and separating utensils. Our MSSRL dataset stands out from existing datasets by focusing specifically on the intricate mechanisms of SSRL. It is designed to elicit specific interactions that are crucial for understanding how groups self-regulate under CT and ET.
Besides inducing cognitive and socio-emotional interactions to gather sufficient data for the SSRL study, the choice of data modality is a crucial factor in establishing a reliable dataset. Human interactions are inherently multimodal, encompassing various channels such as verbal communication and nonverbal cues including body language, facial behavior, and physiological signal [
63,
77]. The utilization of multimodal data is essential as different interactions manifest differently across modalities, offering enhanced flexibility and reliability for human interaction analysis [
77]. Prior research has demonstrated the significance of facial expressions, body gestures, and psychological signals in providing important emotional and interaction cues, contributing to problem-solving in cooperation scenarios [
21,
24,
45,
62]. Additionally, studying individual actions across different modalities proves beneficial for leveraging multiple sources in interaction analysis, necessitating further research on their correlations. To address this, our approach involves collecting data from various modalities, including video, Kinect data stream, audio, and physiological signals. This comprehensive dataset allows for the analysis of interactions from visual, acoustic, and biological perspectives, as depicted in
Figure 2. This multimodal approach provides the opportunity to conduct extensive studies of SSRL, leveraging cues from different sources for a more nuanced understanding of collaborative learning dynamics.
Verbal and nonverbal interaction annotations on multimodalities have been provided with the aim of understanding and studying SSRL. Specifically, in terms of verbal interaction, annotation of interaction for regulation, high-level deliberative interaction, and sentence types have been introduced. Facial expression, eye gaze, gesture, and posture annotation have been annotated for nonverbal interaction. In summary, we adopt a collaborative learning setting to study SSRL and collect a multimodality dataset named MSSRL. A learning task featuring deliberate interventions is administered to 81 high school students with an average age of 15. Extensive multimodal data, encompassing video, Kinect, audio, and physiological signals, totaling approximately 45.5 hours for each modality, is collected and utilized to investigate SSRL. This dataset offers a rich resource for studying the dynamics of cognitive and socio-emotional interactions in collaborative learning settings.
Currently, there are numerous multimodal datasets for studying interactions in collaborative learning. However, only a few studies systematically explore the cognitive and socio-emotional interactions.
To the best of our knowledge, this is the first multimodal dataset tailored for the study of SSRL. Interdisciplinary researchers contribute to the dataset construction, as shown in
Figure 1. Learning science researchers design collaborative tasks with a focus on regulation, delving into cognitive and socio-emotional interactions. Simultaneously, computer science researchers contribute their expertise by considering the multifaceted nature of interactions across modalities for SSRL. The synergy of these interdisciplinary efforts results in the successful collection of the dataset. In a reciprocal manner, this dataset can facilitate advancing studies in various fields, such as computer science, learning science, and social science. Extensive analysis in multimodalities has verified the effectiveness of this dataset. It serves as an invaluable resource for researchers in these domains, providing the means to explore the intricate dynamics of SSRL. Specifically, the annotations of verbal and nonverbal interactions, along with psychological signals, open doors for learning and social science researchers to uncover the mechanisms of interaction and promote collaborative learning. For computer science researchers, this dataset serves as a playground for developing advanced methods to identify and understand these interactions.
The main contributions are as follows:
—
Considering the difficulty of collecting cognitive and socio-emotional interaction for the SSRL study, this article designs novel triggers to foster cognitive and socio-emotional interaction within a collaborative learning task for the SSRL study.
—
A multimodal dataset is proposed comprising video, audio, depth, and physiological modalities, providing a holistic view of the collaborative learning process. A comprehensive analysis of multimodalities verifies the dataset’s effectiveness.
—
Detailed annotations are provided for both verbal and nonverbal interactions to enable a deeper analysis of communication patterns, body language, and emotional expressions, which can be valuable for interdisciplinary research, such as learning sciences and computer science.
The rest of the article is structured as follows:
Section 2 covers related work,
Section 3 details dataset collection,
Section 4 addresses dataset annotation,
Section 5 focuses on dataset effectiveness verification, and
Section 6 discusses the contributions, limitations, and future work.
3 Dataset Collection
To systematically and comprehensively study SSRL, we collect a multimodal dataset under a collaborative learning setting that contains facial videos, audio, physiological signals (including electrodermal activities (EDA), HR, and accelerometer), and Kinect data RGB, depth, silhouette, and skeleton). As far as we know, this is the first multimodal dataset for studying dynamic interactions in collaborative learning with regulatory triggers. It provides an opportunity to comprehensively explore dynamic interactions and regulations that contribute to multiple disciplines, including computer, education, sociology, and psychology. Details of the participants and data collection procedure are explained in this section.
3.1 Participants
The study involves small groups of three high school students aged 15 years on average (N = 81, male = 45, female = 36) who work on a collaborative task. The participants are recruited from high school classes through collaboration with the local teacher training school. In Finland, a participant between 15 and 18 years can participate in a study without parents’ consent if the parent is informed about the study. All students are asked to sign the consent form when he/she understands the contents and agrees to participate in the study. Detailed questions are included in the consent form concerning data sharing-related issues. Besides, their guardians are informed about the study and receive a General Data Protection Regulation (GDPR) document before collecting data. The purpose and procedure of this research are explained to the students before the recording started. All students know they can withdraw at any time during the collection. Overall, the students are divided into 28 groups. Twenty-five groups are with three students. Three groups are only with two students. Since the data collection occurred during the COVID-19 period, many challenges were faced in recruiting participants and setting up the data collection; thus, the experiment is designed with a maximized sample size for investigating the phenomenon.
3.2 Learning Task and Procedure
During the data collection, participants would act as nutrition specialists and work in a collaborative learning task (30–40 minutes) for a smoothie café. Their task was to plan a recipe for customers that supported the immune system during the pandemic.
As this experimental design aims to investigate the effects of specific regulatory triggers, participants are allocated to one of three distinct conditions: (1) Control Group A (9 groups), (2) Treatment Group B (9 groups), and (3) Treatment Group C (10 groups). Control Group A serves as the baseline measure, receiving no intervention, thereby offering a reference point for evaluating the efficacy of the treatments administered to Groups B and C. This group allows researchers to discern any natural fluctuations in the dependent variable, devoid of experimental manipulations. Treatment Group B engages in a collaborative learning task similar to the other groups but is introduced to a singular CT halfway through the task. This CT is hypothesized to stimulate problem-solving abilities and enhance group collaboration, as posited by theories of cognitive facilitation in educational settings. In contrast, Treatment Group C, while also receiving the same initial CT halfway through the collaborative task, is subsequently exposed to three ET, each separated by 3-minute intervals. These ET aim to elicit specific emotional states or responses that could potentially modulate the group dynamics and the effectiveness of the collaborative learning task. Situated self-reports were administered to all participant groups both before and after the collaborative learning task to capture context-specific metacognitive experiences. The experimental design is shown in
Figure 3.
One researcher was in the room with each group the whole time to ensure the smooth procedure of the experiment but would not be involved in the collaborative learning and answer any task-related questions. Other researchers controlled the recording devices remotely and monitored the collecting process in the next room. Smoothie vouchers were promised to motivate participants to engage in the learning task.
3.3 Equipment Setup and Data Synchronization
Similar to previous studies [
10,
19], our data recording is held in a laboratory studio. The setups are illustrated in
Figure 2(a) and (b), respectively. Specifically, three participants sit in front of laptops. Two-meter COVID social distance between participants is kept throughout the collection procedure for health concerns during the pandemic. A
\(360^{\circ}\) camera (Insta360 Pro) is utilized for video recording. It offers the subsets of data at both individual and group levels, which provides a novel and unique opportunity for closely examining interactions. Furthermore, the
\(360^{\circ}\) view allows an in-depth qualitative analysis of the interaction contexts, which is essential for studying the interactive process.
The Insta360 Pro contains six camera spots, and a microphone is placed in the center. The six cameras are hardware synchronized, and the grabbed frames from the six channels are used for building the whole environment in 360°. During the collection, each participant faces one camera directly. In this way, we could have a compact frontal face for every participant, as shown in
Figure 4(a). Figure
4(b) presents the full view in 360° synthesized by the Insta360 Pro. Resolutions of individual video and reconstructed video are
\(3,840\times 2,160\) and
\(1,920\times 960\), respectively, with an average recording rate of 30 fps. In addition, a surveillance camera is applied to monitor and recall with a full-scape view of the studio. A central microphone and three individual microphones are employed to record the audio data of the whole room and each subject.
Two Azure Kinect DKs are utilized to collect RGB and depth videos simultaneously with an average fps of 30. Given that our lab possesses only two Kinect devices and Groups A, B, and C are collected simultaneously, two Kinects can only record the gestures of three participants in one group. Our research aims to investigate dynamic interactions involving regulatory triggers, and for this purpose, Group C, which includes one CT and three ET under full conditions, is the primary focus for analyzing and discussing the impact of these triggers on participants’ behavior and emotions. Therefore, the two Kinects are dedicated to Group C for data collection. Meanwhile, the gesture data of the three participants are estimated by Azure Kinect Body Tracking Software Development Kit. The two devices denoted as “master Kinect” and “slave Kinect” are synchronized. The master Kinect is set in front of the screen. It can record the gestures of two participants and the introduction video played on the screen, which could be utilized for synchronization in experiments. The Slave Kinect is set on the left side of the master Kinect. It can capture the gesture of the participants, which master Kinect cannot capture. Three sensor streams are aggregated in the Kinect device, including a depth camera, a color camera, and an inertial measurement unit. The Azure Kinect Viewer can visualize all the streams, as shown in
Figure 5(a).
Physiological data, including EDA, HR, and accelerometer, are captured by physiological sensors (Shimmer GSR3+) as shown in
Figure 5(b). All the sensor devices are calibrated and synchronized with each other before each session. Sensors are attached to the participant’s nondominant hand so that the gel electrodes are placed on the palm’s thenar and hypothenar eminences. Real-time signals of students’ physiological activities are transmitted via Bluetooth connections to a monitoring laptop and supervised by a researcher. Before starting the data collection, the monitoring researcher ensures that all the sensors function correctly. All the signals are collected at the sampling rate of 128 Hz, which could be used to reveal new insights into the emotional and cognitive processes.
Although the above multimodal data offer promising capabilities for analysis, the synchronization of multiple modalities collected from different channels is challenging in both methodological and theoretical aspects. To reach the finest granularity synchronization possible, the data synchronization is planned before the official collection in which each data collection device clock is synchronized to record the Unix timestamp. The real-time timestamps are then used for data synchronization. The audio and video metadata are tracked with device-recorded Unix timestamps, while every record of physiological data was also associated with a specific Unix timestamp.
Finally, the Kinect data are synchronized with physiological data by the frame change of the video played during the introduction of the collaborative tasks. Specifically, the Kinect and \(360^{\circ}\) camera captured the task-introduction video played at the beginning of the collaborative task. Synchronization between Kinect data and \(360^{\circ}\) video data is achieved through frame changes in the introduction video. Additionally, the \(360^{\circ}\) video data are already synchronized with physiological data through timestamps. Therefore, by synchronizing the Kinect data with the \(360^{\circ}\) video data, we effectively synchronize the physiological data as well.
3.4 Data Statistics and Quality
Due to an unexpected hardware failure, the physiological data of nine participants and videos of three participants are lost. Eventually, the rest of 78 participants’ data are complete and are processed for analysis. Around 2,730 minutes of frontal facial videos and audio data are recorded from 78 participants. Twenty-eight \(360^{\circ}\) videos are collected through stitching the videos from six cameras. The participant region covers around \(600\times 750\), pixels and the facial region, comprising around \(180\times 200\) pixels on average, provides an adequate level of detail for facial analysis.
Around 630 minutes of Kinect data stream (RGB, depth, silhouette, and skeleton) are collected from 30 participants in Group C with CT and ET. In our collaborative learning scenario, where participants are seated, the range of lower body movements is limited. Therefore, the analysis of the upper body becomes a focal point. Specifically, we concentrate on the upper-body region, which typically spans approximately
\(700\times 900\) pixels. Moreover, around 2,040 minutes of physiological data are collected, including HR, EDA, and accelerometer. The data statics of multiple modalities are presented in
Table 2. The list of recorded file formats is shown in
Table 3.
3.5 Ethics, Privacy, and Data Availability
Data collection, storage, and management were conducted in compliance with the GDPR [
66]. Furthermore, all procedures about the dataset adhered to the ethical guidelines established by the Finnish National Board of Research Integrity, the All European Academies’ Code for Research Integrity, and the University of Oulu. Ethical approval was obtained from the Oulu University Ethics Committee (ID 4/21/Sanna Järvelä). Data collection imposes no disadvantages upon the participants. Participation is voluntary, allowing for withdrawal from the study at any time. Separate written ethical consent is mandated from both students and their guardians. Prior to giving consent, both parties are fully informed about the study’s objectives and data management practices (in accordance with GDPR). Pseudonymization through nonpersonal identifiers (created ID numbers) is completed for all data formats other than video and audio of the learning session, which must be analyzed with personal identifiers (likeness, voice) in place.
To promote relevant scientific development in the fields of computer science and learning science, the dataset or specific portions thereof may be made accessible to qualified researchers or research teams upon request. Access to the data will be facilitated through direct communication with the authors, who act as the data custodians. It is important to note that the release of data will be subject to the execution of a data transfer agreement, ensuring responsible use and compliance with ethical standards and legal requirements. Additionally, we published the metadata (available at
https://etsin.fairdata.fi/dataset/69a92e8e-e4c6-4531-a2fb-d951fc5eac90) which can provide the database with a persistent identifier, landing page and distributes its description to other relevant services.
4 Data Annotation
MSSRL aims to provide data supporting multidisciplinary research for studying interactions in collaborative learning. Various annotation schemas have been applied to both verbal and nonverbal interactions in multimodalities [
33,
50]. Verbal interactions are categorized into three levels based on audio data to facilitate the SSRL study. In the case of nonverbal interactions, comprehensive annotations are provided for facial expressions, gazes, gestures, and postures, contributing to the understanding of human communication and behavior. These annotations also serve as vital data for computer science applications, such as developing algorithms for emotion recognition, human–computer interaction, and automated understanding of social interactions, which are essential for future studies.
4.1 Verbal Interaction
For annotating theory-based meaning of verbal interactions, we adopted the human and
Artificial Intelligence (AI) approach by Järvelä et al. [
33] that integrated the unique strengths of both human and AI for micro-qualitative annotation of verbal interactions to study SSRL. The interactions were initially recorded and then transcribed in the original Finnish language. This task of transcription, along with the segmentation into individual speech turns, was carried out using Microsoft’s Azure Cognitive Services. Following the automated process, a validation phase was undertaken to ensure the reliability and accuracy of the data. Two human research assistants, Finnish-native and experienced in transcription techniques, independently reviewed and corrected the automatically generated transcriptions and speech segmentation. To provide a quantitative measure of the accuracy of the automated transcription, the study utilized the difflib Python library to compare the machine-generated text with the human-corrected version. This algorithmic comparison was performed on a total of 6,111 utterances from the conditions with both triggers and yielded a similarity score of 81.46%. This high level of congruence between the two versions not only corroborates the efficacy of automated transcription services but also accentuates the need for human oversight to capture the subtleties of natural language.
The same approach was applied to the translation process, in which each utterance was first automatically translated using Azure Cognitive Services. This machine-generated translation was then validated and corrected by a human research assistant.
Continuing with the annotation process, the dataset implemented a multilevel, theory-driven qualitative coding scheme to annotate each utterance with a meaningful label. This comprehensive qualitative annotation extended through three hierarchical levels to provide a nuanced understanding of the SSRL interactions. The annotation consists of three layers: (1) macrolevel concepts concerning types of interactions for regulation, (2) microlevel concepts focusing on the deliberative characteristics of the interaction, and (3) types of sentences. At the macrolevel, the focus was on categorizing the types of interactions that were primarily regulatory. The types of interactions were systematically organized and described in
Table 4. This level offered a broad view of how participants engage in different forms of interaction that either facilitate or inhibit effective regulation.
The microlevel of analysis further refined our understanding by focusing on the deliberative characteristics of each interaction [
15,
16]. A detailed account of these deliberative characteristics is available in
Table 5. This level of annotation allowed us to isolate and examine the subtle strategies and mechanisms individuals employ during interactions to collectively regulate their learning process.
Lastly, at the base level, each utterance was classified according to the type of sentence used—whether it was a statement, a question, or other sentence types, as shown in
Table 6. This basic classification served as a foundational layer that enabled more advanced layers of coding and interpretation.
Firstly, both annotators underwent extensive training on the annotation framework and engaged in calibration sessions to align their understanding and application of the annotation criteria [
7]. Secondly, the annotation process included iterative reviews where both annotators discussed and resolved discrepancies, thereby enhancing the consistency of their annotations. Thirdly, despite the common practice of employing at least three annotators, we conducted a commonly used inter-annotator agreement analysis to assess the reliability of the annotations. This analysis was conducted on a 20% sample of our dataset and provided a quantitative measure of consistency between annotators, and the results were found to be within acceptable ranges for qualitative research, indicating a high level of agreement. Such a method aligns with established practices in several research fields, specifically in learning sciences research, as evidenced by its application in highly regarded journal articles, including the study by Järvenoja et al. [
35].
These annotations provide a structured approach for analyzing verbal interactions across various levels of abstraction, employing different theoretical perspectives from the learning sciences at both group and individual levels. They enable researchers to dissect how communication contributes to both SSRL and self-regulation of learning in group settings. Furthermore, the three-level abstraction of these annotations aids in identifying specific group- or individual-level verbal behaviors that facilitate or hinder effective learning, offering valuable insights for developing targeted interventions to improve collaborative learning outcomes. The macrolevel analysis focuses on forms of interaction and the microlevel analysis deepens this by looking into specific interaction characteristics. These annotation levels help to identify especially group- or peer-level interaction processes. The base level focuses on individual utterances and identifies specific sentence types to reveal individual-level processes, such as how individual learners contribute verbally to the group-level interaction processes. In all, this approach enables a comprehensive examination of interaction processes and mechanisms for learning regulation at different granularities.
4.2 Facial Expression
Socio-emotional interactions reflect fluctuations in learners’ participation in terms of emotional expressions, which enables a comprehensive understanding of collaborative learning processes. Recognizing individuals’ emotions through facial expression recognition by AI within this context is essential, and emotion annotation during triggers serves several vital research purposes [
40,
41,
43]. Firstly, it allows for the investigation of how CT and ET impact collaboration, revealing their effectiveness in shaping interactions. Secondly, emotion annotation assists in assessing AI’s capability to accurately identify facial expressions during interactions, a task that is both time-consuming and labor-intensive for humans. The data 30 seconds before and after every trigger are annotated with three emotion categories, including
negative,
positive, and
neutral. The annotation process is conducted in three steps. First, we extract the frames around the CT and ET and roughly crop the facial regions to make the facial expressions easy to follow. Second, 10 annotators work independently after a preparatory course. Each annotator is required to annotate three trigger clips. Labels are assigned to every trigger clip in seconds instead of frames because the emotion changes in an evolutionary manner. A tool is developed to play the frames in seconds for annotation continuously. Finally, 10 annotators participated in the annotation of the video data. Each video was annotated by three annotators independently. The final annotation was decided based on the emotion category that received the most votes among the three annotators.
The distribution of emotions among participants is as follows: positive emotions: \(16.01\%\), negative emotions: \(4.06\%\), neutral emotions: \(79.93\%\). This distribution indicates that the majority of participants’ emotional expressions fall within the “neutral” category, with smaller percentages expressing “positive” and “negative” emotions. This information provides valuable insights into the emotional dynamics of collaborative learning, indicating that the learning environment may generally be characterized by a sense of neutrality and composure among participants.
4.3 Eye Gaze
In the multimodal dataset for interaction, the annotation of eye gazes occupies a critical role in analyzing nonverbal elements in communicative processes. Eye gaze annotation is invaluable for understanding communication dynamics, cognitive processes, and learning engagement [
22]. It captures subtle cues, such as gaze shifts and patterns, revealing attentional focus, engagement levels, and comprehension strategies. In our collaborative learning settings, leveraging eye gaze enhances SSRL and group dynamics. Monitoring engagement through gaze patterns allows educators to identify and intervene with disengaged learners, ensuring active participation. Eye gaze cues facilitate smooth turn-taking, fostering equitable involvement and a supportive atmosphere. Encouraging learners to maintain eye contact promotes active listening, enhancing communication and understanding. Gaze signals also support social regulation by conveying interest, agreement, or disagreement, aiding collaborative problem-solving. Integrating gaze feedback provides real-time insights into group dynamics and fostering reflection. Moreover, eye gaze data aids peer assessment, offering objective indicators of participation and engagement for accountability. Overall, utilizing eye gaze in collaborative learning enhances SSRL by promoting engagement, facilitating communication, and supporting social and cognitive regulation processes among learners.
This study utilized a theory-driven manual annotation approach for eye gazes. Experienced coders annotated instances of gaze to identify focus areas and duration, guided by pre-established theoretical frameworks that consider the importance of gaze in signaling attention, cognitive load, or emotional state. The ELAN software was used for the task of segmenting and coding the eye gazes in the multimodal dataset. The inter-coder reliability checks are applied where multiple coders annotate the same data independently to ensure consistency. The types of eye gazes were coded including partner-oriented gaze, object-oriented gaze, iconic gaze, and others, as shown in
Table 7.
These manually annotated eye gaze metrics are integrated into the overall analysis of interactions, providing a more complete view of how individuals interact. By combining this information with verbal and other nonverbal cues, the dataset offers a richer analysis of interactive behavior, making it a valuable resource for studying socially shared regulation in collaborative learning settings.
4.4 Gesture and Posture
Gesture and posture in a collaborative learning setting is a valuable methodology for understanding the nuances of nonverbal communication, engagement, and the learning experience among participants [
26,
67]. Gestures refer to body movements or actions to convey a message, express an emotion, or emphasize a point in communication. Four types of gestures are annotated based on McNeill’s classification [
50]: deictics, beats, iconics, and metaphorics for interaction studying in a collaborative learning setting, as shown in
Table 8. These categories are valuable for studying and analyzing interactions in a collaborative learning setting, where gestures play a role in enhancing understanding and engagement.
Postures are defined as a position of the body or of body parts [
49]. Postural orientation and slumped/upright positions have been provided for understanding nonverbal communication and human behavior. Specifically, postural orientation can indicate the direction of one’s body or body parts, such as facing to the right or left, while slumped/upright annotations capture the alignment of the body along the vertical axis, from a slouched position to an upright one. These annotations provide valuable data for studying body language and its implications in various contexts, from psychology and sociology to human-computer interaction.
6 Discussion
In this section, we discuss our findings and contributions and future illustrates the limitations and future work.
6.1 Contributions to Research on Multimodal Interactions for SSRL
6.1.1 Cognitive and Socio-Emotional Interaction.
Examining cognitive and socio-emotional interactions in the context of SSRL, group regulation in collaborative learning is crucial for both advancing learning sciences theories and designing effective supports for learners in collaborative settings. However, more than existing datasets for interactions study would be required to develop methods for examining the underlying mechanism of cognitive and socio-emotional interactions within the context of SSRL. Accordingly, based on the novel concept of trigger events for regulation [
33], this study has provided a multimodal dataset with designed triggers to regulate emotional and cognitive processes during interactions. This dataset has significant implications for further methodological development and theoretical advancement in researching and understanding the dynamic interaction mechanisms that govern SSRL in collaborative learning.
Our dataset enables the investigation of cognitive and socio-emotional interactions through the lens of strategically designed regulatory triggers that can inform the development of targeted support mechanisms for learners confronted with challenges in collaborative settings. The analysis of cognitive and socio-emotional interactions in relation to regulatory triggers has several far-reaching implications for both theory and practice. First, the identification of cognitive and socio-emotional interactions associated with such triggers can serve as a diagnostic tool for educators and facilitators to predict points of difficulty within collaborative activities. This predictive utility can then be operationalized through educational technology, such as intelligent tutoring systems, to offer timely interventions that guide groups through cognitively or emotionally challenging scenarios [
1]. Second, understanding these triggers can enhance the design of collaborative platforms [
33]. Systems can be engineered to provide dynamic scaffolding, and tailoring assistance based on the type of trigger encountered, whether it is cognitive or socio-emotional. This leads to more effective support for both self-regulated learning and SSRL [
2]. Third, from a pedagogical standpoint, curricula can be designed to include explicit training on recognizing and responding to these triggers, thereby equipping learners with the metacognitive and emotional regulation skills necessary for effective collaboration. Lastly, for researchers, focusing on interactions associated with these triggers offers a refined unit of analysis for investigating the complex interplay between cognitive and socio-emotional processes in collaborative settings.
6.1.2 Multimodalities.
Our dataset provides multimodal data for collaborative learning, including facial videos, audio, physiological signals (including EDA, HR, and accelerometer), and Kinect data (RGB, depth, silhouette, and skeleton. Multimodal analysis is essential for studying human interaction as it offers a comprehensive perspective on communication. It enriches data representation by considering various channels like speech, gestures, and facial expressions, enabling a deeper comprehension of interactions. Moreover, it facilitates contextual understanding by allowing different modalities to complement each other. For instance, nonverbal cues such as facial expressions can provide insight into the emotional tone of spoken words, while gestures can elucidate the meaning of written text.
Multimodal analysis is highly beneficial in collaborative learning settings, revolutionizing the educational experience [
57,
85]. One prominent application is engagement monitoring in online education [
8,
28], where digital platforms harness a blend of video feeds and interaction data to gauge student engagement levels. Through the analysis of facial expressions, eye gaze, mouse clicks, and keystrokes, these platforms can discern the moments when students are fully engaged or when their attention wanes. This real-time insight empowers educators to adapt learning content dynamically, offer additional support, or suggest timely breaks to reengage students effectively. Furthermore, peer assessment and feedback in group projects stand to gain significantly from multimodal analysis [
23]. Instructors and AI systems can comprehensively evaluate collaboration quality during virtual group meetings by analyzing audio recordings, text chats, and screen sharing. This holistic assessment encompasses factors such as the distribution of speaking time, the quality of discussions, and individual contributions, resulting in more accurate peer evaluations and fairer grading processes. Additionally, multimodal analysis demonstrates its worth in special education, particularly in nurturing social skills development [
11]. Educators can provide real-time feedback to students with autism or social challenges by tracking facial expressions, body language, and vocal intonations during social interactions. This approach contributes significantly to enhancing communication skills and fostering the ability to recognize and respond to social cues effectively.
6.2 Interdisciplinary Approach
Another substantial contribution of this study relates to the interdisciplinary approach with preliminary results to examine the utility of the proposed dataset. Our results reported a significant difference in emotion and gesture change in groups among different CT and ET. Our findings support that external events would dynamically influence students’ learning interactions [
73].
Moreover, we provided annotations on verbal interaction, facial expression, gaze, gesture, and posture. These annotations can serve as training data for machine learning models, particularly those related to computer vision and multimodal data analysis. Researchers can use this resource to create and fine-tune AI algorithms for tasks such as emotion recognition, human–computer interaction, and more [
42,
47,
86]. The dataset encompasses challenging real-world situations. Challenging scenarios are valuable for testing the robustness and effectiveness of AI algorithms. They can help researchers develop models that perform well in real-world, less-controlled environments.
Our interdisciplinary approach in this presented study also responds to the recent calls for interdisciplinary efforts bridging learning sciences, sociology, machine learning, and computer science to maximize the impact of multimodal data and advanced techniques in examining and supporting emotional and cognitive processes. This article contributes to the field of computer sciences by offering a novel dataset for multimodal model development, the field of learning sciences by providing new insights into the trigger moments for cognitive and emotional processes in collaborative learning, and the field of sociology by studying regulatory trigger influences on interactions to develop interactive intelligent systems.
6.3 Limitation and Future Work
Since this is a preliminary study of multimodal analysis for SSRL, several limitations should be addressed in our future work. One thing is about the annotation. The dataset includes annotations for verbal interactions, gazes, gestures, and postures. These annotations are firmly rooted in theoretical frameworks drawn from both computer sciences and learning sciences, thereby enhancing the multidisciplinary nature of our research. We anticipate that these extensive annotations will enhance our ability to achieve a more detailed and nuanced comprehension of the data. However, it’s important to clarify that the above annotations underwent a reliability assessment carried out by two independent coders. Although we were unable to involve a third annotator across the various annotation categories, as recommended, we acknowledge this as a limitation of our study. Despite rigorous annotation from the viewpoints of both computer sciences and learning sciences and a carefully executed reliability test involving two independent coders, the incorporation of a third independent coder could potentially enhance the reliability of these annotations further.
Besides, we only recorded the Kinect data of three participants in Group C due to equipment constraints. Although the Kinect data of Group A and Group B have not been collected, we can also analyze the upper-body gestures of participants in Groups A and B collected by \(360^{\circ}\) cameras in the future.
Notably, the primary focus of this work is on the data collation, categorization, description, and validation of dataset effectiveness through the analysis of the designed triggers, rather than the evaluation or proposal of computational models to interpret the data. Therefore, the predictive capabilities of the dataset in terms of qualitative annotations have not been explored in the current study. A crucial avenue for future research involves the establishment of baseline predictions for the qualitative annotations within the dataset. Such baselines would serve as empirical benchmarks for comparing and evaluating the performance of subsequent predictive models. This is a nontrivial task, given the complexities inherent in multimodal interactions, and represents an essential next step for fully leveraging the utility of the dataset in both computational and educational contexts.
Another limitation is that this work only analyzes interactions influenced by triggers. The relationship between emotional and cognitive processes in the whole process should be further explored.