Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Open access

Interactions for Socially Shared Regulation in Collaborative Learning: An Interdisciplinary Multimodal Dataset

Published: 02 August 2024 Publication History

Abstract

Socially shared regulation plays a pivotal role in the success of collaborative learning. However, evaluating socially shared regulation of learning (SSRL) proves challenging due to the dynamic and infrequent cognitive and socio-emotional interactions, which constitute the focal point of SSRL. To address this challenge, this article gathers interdisciplinary researchers to establish a multimodal dataset with cognitive and socio-emotional interactions for SSRL study. Firstly, to induce cognitive and socio-emotional interactions, learning science researchers designed a special collaborative learning task with regulatory trigger events among triadic people for the SSRL study. Secondly, this dataset includes various modalities like video, Kinect data, audio, and physiological data (accelerometer, EDA, heart rate) from 81 high school students in 28 groups, offering a comprehensive view of the SSRL process. Thirdly, three-level verbal interaction annotations and nonverbal interactions including facial expression, eye gaze, gesture, and posture are provided, which could further contribute to interdisciplinary fields such as computer science, sociology, and education. In addition, comprehensive analysis verifies the dataset’s effectiveness. As far as we know, this is the first multimodal dataset for studying SSRL among triadic group members.

1 Introduction

Collaborative learning is a social system in which groups of learners solve problems or construct knowledge by working together [4], which is a vital skill in today’s knowledge-driven society and stands out as a widely embraced educational approach [25]. Understanding and studying interactions in collaborative learning has gained significant interest across diverse disciplines [20, 64, 75] because it enables us to improve teaching methods, learning outcomes, and the overall educational experience for students. Moreover, collaborative learning can be adopted in professional settings such as business environments, corporate training programs, and research endeavors. Therefore, achieving success in collaborative learning holds significant importance.
Recent research underscores the crucial role of socially shared regulation in learning (SSRL) as a determinant for successful collaborative learning initiatives [27]. SSRL involves regulating group behavior and refining shared understanding through negotiation, integral for effective collaborative learning, shaping group dynamics, and enhancing shared knowledge construction [27]. The focal point of SSRL lies in the interplay between cognitive and socio-emotional interactions, recognized as primary mechanisms for facilitating group regulation in collaborative learning environments [31]. In this article, our objective is to collect a multimodal dataset of cognitive and socio-emotional interactions for the in-depth SSRL study, denoted as multimodal dataset of socially shared regulation in learning (MSSRL). The construction of the dataset requires collaborative efforts from interdisciplinary fields including learning science and computer science researchers. Learning science researchers design collaborative tasks to collect cognitive and socio-emotional interactions, while computer science researchers allocate multiple sensors to gather multimodal interactions. This constructed interdisciplinary dataset can facilitate studies in computer science, learning science, and social science, as shown in Figure 1.
Figure 1.
Figure 1. The concept of interdisciplinary collaboration for MSSRL.
The primary challenge in collecting the SSRL dataset lies in the transient, dynamic, and infrequent nature of cognitive and socio-emotional interactions among group members [2, 58, 83]. It is difficult to collect pertinent data on these interactions associated with SSRL, hindering the establishment and evaluation of its measurement for the design of situated support. To address this challenge, we designed a special collaborative learning task with regulatory trigger events to induce cognitive and socio-emotional interactions among triadic people for further SSRL study. In learning sciences, the trigger events have been regarded as challenging events and/or situations that may hinder progress in collaboration [17, 33]. With the trigger events, we illustrate real-world situations that require appropriate and strategic response and adaptation in the regulation of cognition, emotion, motivation, and behavior; thus, they should generalize to such scenarios. One example of this scenario is that group members struggle to construct shared knowledge when realizing a misunderstanding of task instructions (cognitive trigger (CT)). This situation would require appropriate regulation of cognition from the group members to achieve a shared understanding of the given task. Another example could be a malfunctioning tool with a looming deadline (emotional trigger (ET)) which may challenge group members. Efficient time use requires the regulation of emotions, particularly in navigating challenges. The concept of triggers offers a useful framework for investigating how regulation in collaborative and individual learning scenarios occurs. It allows researchers and educators to identify specific instances where self-regulation or co-regulation could be activated, adjusted, or maintained, thereby providing valuable insights into the dynamic nature of regulatory processes in learning.
In this study, we gathered data to investigate SSRL using a simulated smoothie bar task, focusing on ensuring safe smoothie preparation for an allergic customer. The CT, “the customer has allergies,” initiates shared understanding among team members, aided by SSRL, ensuring allergen avoidance. In addition, ET like “hurry up” caused by long queues induce SSRL to manage stress and maintain the balance between speed and safety through effective communication. SSRL also regulated group behavior, coordinating actions to prevent cross-contamination through practices like blender cleaning and separating utensils. Our MSSRL dataset stands out from existing datasets by focusing specifically on the intricate mechanisms of SSRL. It is designed to elicit specific interactions that are crucial for understanding how groups self-regulate under CT and ET.
Besides inducing cognitive and socio-emotional interactions to gather sufficient data for the SSRL study, the choice of data modality is a crucial factor in establishing a reliable dataset. Human interactions are inherently multimodal, encompassing various channels such as verbal communication and nonverbal cues including body language, facial behavior, and physiological signal [63, 77]. The utilization of multimodal data is essential as different interactions manifest differently across modalities, offering enhanced flexibility and reliability for human interaction analysis [77]. Prior research has demonstrated the significance of facial expressions, body gestures, and psychological signals in providing important emotional and interaction cues, contributing to problem-solving in cooperation scenarios [21, 24, 45, 62]. Additionally, studying individual actions across different modalities proves beneficial for leveraging multiple sources in interaction analysis, necessitating further research on their correlations. To address this, our approach involves collecting data from various modalities, including video, Kinect data stream, audio, and physiological signals. This comprehensive dataset allows for the analysis of interactions from visual, acoustic, and biological perspectives, as depicted in Figure 2. This multimodal approach provides the opportunity to conduct extensive studies of SSRL, leveraging cues from different sources for a more nuanced understanding of collaborative learning dynamics.
Figure 2.
Figure 2. The data collection setup. (a) An illustration of the seating plan and the location of the devices; (b) the real environment of data collection.
Verbal and nonverbal interaction annotations on multimodalities have been provided with the aim of understanding and studying SSRL. Specifically, in terms of verbal interaction, annotation of interaction for regulation, high-level deliberative interaction, and sentence types have been introduced. Facial expression, eye gaze, gesture, and posture annotation have been annotated for nonverbal interaction. In summary, we adopt a collaborative learning setting to study SSRL and collect a multimodality dataset named MSSRL. A learning task featuring deliberate interventions is administered to 81 high school students with an average age of 15. Extensive multimodal data, encompassing video, Kinect, audio, and physiological signals, totaling approximately 45.5 hours for each modality, is collected and utilized to investigate SSRL. This dataset offers a rich resource for studying the dynamics of cognitive and socio-emotional interactions in collaborative learning settings.
Currently, there are numerous multimodal datasets for studying interactions in collaborative learning. However, only a few studies systematically explore the cognitive and socio-emotional interactions. To the best of our knowledge, this is the first multimodal dataset tailored for the study of SSRL. Interdisciplinary researchers contribute to the dataset construction, as shown in Figure 1. Learning science researchers design collaborative tasks with a focus on regulation, delving into cognitive and socio-emotional interactions. Simultaneously, computer science researchers contribute their expertise by considering the multifaceted nature of interactions across modalities for SSRL. The synergy of these interdisciplinary efforts results in the successful collection of the dataset. In a reciprocal manner, this dataset can facilitate advancing studies in various fields, such as computer science, learning science, and social science. Extensive analysis in multimodalities has verified the effectiveness of this dataset. It serves as an invaluable resource for researchers in these domains, providing the means to explore the intricate dynamics of SSRL. Specifically, the annotations of verbal and nonverbal interactions, along with psychological signals, open doors for learning and social science researchers to uncover the mechanisms of interaction and promote collaborative learning. For computer science researchers, this dataset serves as a playground for developing advanced methods to identify and understand these interactions.
The main contributions are as follows:
Considering the difficulty of collecting cognitive and socio-emotional interaction for the SSRL study, this article designs novel triggers to foster cognitive and socio-emotional interaction within a collaborative learning task for the SSRL study.
A multimodal dataset is proposed comprising video, audio, depth, and physiological modalities, providing a holistic view of the collaborative learning process. A comprehensive analysis of multimodalities verifies the dataset’s effectiveness.
Detailed annotations are provided for both verbal and nonverbal interactions to enable a deeper analysis of communication patterns, body language, and emotional expressions, which can be valuable for interdisciplinary research, such as learning sciences and computer science.
The rest of the article is structured as follows: Section 2 covers related work, Section 3 details dataset collection, Section 4 addresses dataset annotation, Section 5 focuses on dataset effectiveness verification, and Section 6 discusses the contributions, limitations, and future work.

2 Related Work

2.1 SSRL within Collaborative Learning

Collaborative learning is an educational approach for enhanced learning [30] where two or more learners work together to solve problems, complete tasks, or learn new concepts. Learners work as a group rather than individually to obtain a complete understanding by interchanging their ideas, processing and synthesizing information instead of rote memorization of texts [68]. Multiple learners engage in cognitive and socio-emotional processes during the task, which dynamically shapes the performance of the individuals and the group [38]. Specifically, cognitive processes refer to learners’ striving to achieve a goal through thinking, reasoning, and discussing. In contrast, socio-emotional processes involve emotions or motivation, such as encouragement and positive appraisal [31]. Although cognitive and socio-emotional processes are partly internal [38], they are also shared and shaped by social interactions, referred to as cognitive and socio-emotional interactions. In the learning sciences, there’s a significant focus on understanding the role of cognitive and socio-emotional interactions as key mechanisms for promoting group regulation in collaborative learning environments. This concept is often referred to as SSRL. Existing research emphasizes that SSRL plays a crucial role in determining the success of collaborative learning initiatives [27].
Through cognitive and socio-emotional interactions, SSRL operates through a complex interplay of cognitive and socio-emotional processes among group members. On the cognitive front, learners mutually scaffold each other’s understanding and knowledge construction through shared problem-solving, idea articulation, and metacognitive strategy deployment [33]. These cognitive interactions serve to synchronize the group’s goals, ensuring alignment in both learning objectives and methods for achieving those objectives. Concurrently, socio-emotional interactions lay the foundation for establishing a conducive learning climate. These interactions often manifest in the form of emotional support, motivation sharing, and conflict resolution, all of which are critical for maintaining positive group dynamics [18]. Such interactions serve to enhance group cohesion and emotional synchrony, thereby creating a synergistic environment where learners are more inclined to engage in cognitive scaffolding and co-regulatory practices [39, 58]. Analyzing cognitive and socio-emotional interactions through SSRL in collaborative learning allows for a more nuanced understanding of how group learning can be effectively regulated, thereby optimizing the potential for successful learning outcomes.
However, identifying and supporting SSRL within collaborative learning settings has proven to be a complex task [2, 59]. The challenge in identifying SSRL instances can be attributed to the transient and dynamic nature of cognitive and socio-emotional interactions among group members. These interactions are often fluid and interdependent, making it difficult to isolate moments of SSRL within the broader collaborative process [58]. Moreover, the varying levels of individual engagement and differing group dynamics further complicate the identification and analysis of SSRL [82]. Furthermore, the literature also points out that SSRL occurrences are not frequent in collaborative learning [83]. It is difficult to collect relevant data on cognitive and socio-emotional interactions associated with SSRL to establish and further evaluate its measurement to design situated support. To address these challenges, recent research in the learning sciences has introduced the concept of trigger events [33] as challenging events and/or situations that may hinder progress in collaboration. With the trigger events, we illustrate real-world situations that require appropriate and strategic response and adaptation in the regulation of cognition, emotion, motivation, and behavior; thus, they should generalize to such scenarios. Inspired by the previous research on triggers [33], we proposed to design CT and ET to induce cognitive and socio-emotional interactions for the SSRL study, respectively.

2.2 Relevant Datasets

Currently, numerous works dedicated to the study and enhancement of collaborative learning have been proposed, accompanied by the development of multiple datasets tailored to facilitate research within this domain [3, 5]. However, the majority of current datasets focus on fundamental interactions, such as gaze, gestures, and emotional cues [10, 52]. Additionally, several datasets have sought to explore leadership styles and study higher-level interactions, such as dominance, rapport, and competence [5, 6, 56]. The cognitive and socio-emotional interactions as primary mechanisms for facilitating group regulation in collaborative learning haven’t been explicitly studied in previous datasets. This article constructs the MSSRL dataset for studying the cognitive and socio-emotional interactions. In this subsection, we review previous interaction datasets based on their objectives and highlight the unique strengths of MSSRL.

2.2.1 Natural Interaction.

The University of Texas at Austin-Interaction dataset [69, 70] captures six classes of human–human interactions, including shaking hands, pointing, hugging, pushing, kicking, and punching. Balazia et al. introduce the bodily behavior dataset in social interaction [3], extending interactions to four people. These datasets mainly rely on visual data, addressing basic social interactions.
Recent research demonstrates that multimodal analysis is crucial for comprehensively studying human behavior [36, 74]. Palmero et al. proposed a multimodal dataset of face-to-face dyadic interactions, named UDIVA, recorded by multiple audiovisual and physiological sensors [61]. In UDIVA database, audio, video, and heart rate (HR) are considered for the dyadic interaction study. On the other hand, Jansen et al. presented a multimodal laughter during interaction database to study the expressive patterns of conversational and humor-related laughter based on audiovisual, body movement, Electrocardiogram (ECG), and Galvanic Skin Response (GSR) [32]. Moreover, Carletta et al. collected a pilot Augmented Multi-party Interaction Meeting corpus [10] established in meeting environment with multisubject settings. It can be used to study the interactions in realistic meeting situations.

2.2.2 Leadership.

Multiple research studies on leadership styles in small groups, considering factors like dominance and rapport [5, 6, 55, 71]. Analyzing leadership and leadership styles, a crucial aspect of effective teamwork, is a prominent research area in social and organizational psychology. Beyan et al. [6] proposed to detect emergent leaders in meeting environments based on nonverbal visual features. Additionally, they delved into predicting the leadership style of emergent leaders using both audio and visual nonverbal features [5]. The AMIGOS dataset [52] contributes to multimodal research on affect, personality traits, dominance, and mood in both individuals and groups, featuring wearable sensors for Electroencephalogram (EEG), ECG, and GSR data collection, along with frontal High Definition (HD) video and Red, Green, Blue (RGB), and depth full-body videos during the observation of short and long videos.
The above-mentioned databases often concentrate on observing human interactions without external interventions, which may not fully replicate real-life situations. Cafaro et al. [9] first designed the NoXi dataset to capture spontaneous mediated novice–expert interactions, with a particular emphasis on adaptive behaviors and unexpected situations such as conversational interruptions, someone calling in, or walking into the interaction. To comprehensively capture the interactions, NoXi utilizes Kinect and headsets to collect data on full-body movements, facial expressions, gestures, speech, video, depth, and skeletal information.
Different from NoXi, which investigates spontaneous mediated novice–expert interactions with interventions, this article aims to elicit and capture cognitive and socio-emotional interactions within groups for SSRL study. It is achieved by introducing CT and ET through the design of interventions. To comprehensively understand the SSRL, video, Kinect data stream, audio, and physiological data are collected. It provides an interdisciplinary dataset for studying SSRL. Specifically, task design, data collection, and dataset verification are all co-designed and conducted with interdisciplinary perspectives, including computer science, learning science, and psychology. Apart from investigating SSRL, the proposed dataset additionally provides materials for the research of emotion recognition, physiological signal analysis, and multimodal fusion in the future. The specific comparisons with other similar datasets are shown in Table 1. In summary, the MSSRL dataset contributes significantly to educational research by offering novel triggers designed to foster cognitive and socio-emotional interactions in collaborative learning contexts, which cannot simply be achieved by directly providing annotations on existing datasets. It focuses specifically on SSRL, providing comprehensive data collection across multiple modalities. This tailored approach allows for nuanced analysis of group dynamics, individual behaviors, and learning interactions under SSRL that are often overlooked in existing datasets. By filling a gap in research and facilitating the study of SSRL, the MSSRL dataset aids in the development of pedagogical tools and strategies for enhancing collaborative learning outcomes.
Table 1.
DatasetPurposeNoPPeoSizeModalitiesAnnotation
UT-Interaction interactions [69, 70]Recognizing human–human interactions24020 minutesVHuman interaction
BBSI [3]Bodily behaviors in social interaction3/47826 hoursA/VBody language
Leadership style prediction [5]Leadership style of an emergent leader464393 minutesA/VDemocratic, Autocratic and Not-a-leader
ELEA AVS corpus [71]Emergent leadership in small groups3/485 120 minutesA/VPerformance in tasks, Leadership, Dominance
MPIIGroup Interaction [55]Detecting low rapport during natural interactions478440 minutesA/VRapport, Leadership, Dominance, Competence, Liking, Personality
PAVIS dataset [6]Identifying emergent leader3/464393 minutesA/VLeader behavior ranking
AMI meeting corpus [10]Uses of a consortium developing meeting browsing technology4-100 hoursA/VAbstract, Decision, Individual actions, Attention, Movement, Emotion, Topic
NoXi [9]Natural interactions in an knowledge-sharing context28725 hours 18 minutesA/V/DHead movement and direction, Eyebrow movements, Gaze, Direction, Smile, Gesture, Engagement, Audio, Hand position
MULAI [32]Expressive patterns related laughter232357 minutesA/V/BM/ECG/GSRLaughter related events
UDIVA [61]Context-aware personality inference214790.5 hoursA/V/HRPersonality scores, Sociodemographics, Mood, Fatigue, Relationship type
AMIGOS [52]Affect, personality, mood, and social context recognition1 and 440-A/V/D/EEG/GSR/ECGEmotion, Valence, Arousal, Dominance, Liking
MSSRL (ours)Collaborative learning with regulations38145.5 hoursA/V/D/ACC/EDG/GSRInteraction for regulation, Deliberative interaction, Sentence type, Facial expression, Gaze, Gesture, Posture
Table 1. Dataset Comparison
A, audio; ACC, accelerometer; BM, body movement; D, depth; EDG, electrodermal activity; EEG, electroencephalogram; GSR, galvanic skin response; HR, heart rate; NoP, number of participants in the interaction; Par, participants in total; V, video.

3 Dataset Collection

To systematically and comprehensively study SSRL, we collect a multimodal dataset under a collaborative learning setting that contains facial videos, audio, physiological signals (including electrodermal activities (EDA), HR, and accelerometer), and Kinect data RGB, depth, silhouette, and skeleton). As far as we know, this is the first multimodal dataset for studying dynamic interactions in collaborative learning with regulatory triggers. It provides an opportunity to comprehensively explore dynamic interactions and regulations that contribute to multiple disciplines, including computer, education, sociology, and psychology. Details of the participants and data collection procedure are explained in this section.

3.1 Participants

The study involves small groups of three high school students aged 15 years on average (N = 81, male = 45, female = 36) who work on a collaborative task. The participants are recruited from high school classes through collaboration with the local teacher training school. In Finland, a participant between 15 and 18 years can participate in a study without parents’ consent if the parent is informed about the study. All students are asked to sign the consent form when he/she understands the contents and agrees to participate in the study. Detailed questions are included in the consent form concerning data sharing-related issues. Besides, their guardians are informed about the study and receive a General Data Protection Regulation (GDPR) document before collecting data. The purpose and procedure of this research are explained to the students before the recording started. All students know they can withdraw at any time during the collection. Overall, the students are divided into 28 groups. Twenty-five groups are with three students. Three groups are only with two students. Since the data collection occurred during the COVID-19 period, many challenges were faced in recruiting participants and setting up the data collection; thus, the experiment is designed with a maximized sample size for investigating the phenomenon.

3.2 Learning Task and Procedure

During the data collection, participants would act as nutrition specialists and work in a collaborative learning task (30–40 minutes) for a smoothie café. Their task was to plan a recipe for customers that supported the immune system during the pandemic.
As this experimental design aims to investigate the effects of specific regulatory triggers, participants are allocated to one of three distinct conditions: (1) Control Group A (9 groups), (2) Treatment Group B (9 groups), and (3) Treatment Group C (10 groups). Control Group A serves as the baseline measure, receiving no intervention, thereby offering a reference point for evaluating the efficacy of the treatments administered to Groups B and C. This group allows researchers to discern any natural fluctuations in the dependent variable, devoid of experimental manipulations. Treatment Group B engages in a collaborative learning task similar to the other groups but is introduced to a singular CT halfway through the task. This CT is hypothesized to stimulate problem-solving abilities and enhance group collaboration, as posited by theories of cognitive facilitation in educational settings. In contrast, Treatment Group C, while also receiving the same initial CT halfway through the collaborative task, is subsequently exposed to three ET, each separated by 3-minute intervals. These ET aim to elicit specific emotional states or responses that could potentially modulate the group dynamics and the effectiveness of the collaborative learning task. Situated self-reports were administered to all participant groups both before and after the collaborative learning task to capture context-specific metacognitive experiences. The experimental design is shown in Figure 3.
Figure 3.
Figure 3. An illustration of the collaborative learning task. Group members work together to prepare a healthy smoothie for a customer. When the CT applies, the customer says she has an allergy to latex protein and dairy products. Afterward, the group will be presented with several ET at a specific time interval. Zoom in for a better view.
One researcher was in the room with each group the whole time to ensure the smooth procedure of the experiment but would not be involved in the collaborative learning and answer any task-related questions. Other researchers controlled the recording devices remotely and monitored the collecting process in the next room. Smoothie vouchers were promised to motivate participants to engage in the learning task.

3.3 Equipment Setup and Data Synchronization

Similar to previous studies [10, 19], our data recording is held in a laboratory studio. The setups are illustrated in Figure 2(a) and (b), respectively. Specifically, three participants sit in front of laptops. Two-meter COVID social distance between participants is kept throughout the collection procedure for health concerns during the pandemic. A \(360^{\circ}\) camera (Insta360 Pro) is utilized for video recording. It offers the subsets of data at both individual and group levels, which provides a novel and unique opportunity for closely examining interactions. Furthermore, the \(360^{\circ}\) view allows an in-depth qualitative analysis of the interaction contexts, which is essential for studying the interactive process.
The Insta360 Pro contains six camera spots, and a microphone is placed in the center. The six cameras are hardware synchronized, and the grabbed frames from the six channels are used for building the whole environment in 360°. During the collection, each participant faces one camera directly. In this way, we could have a compact frontal face for every participant, as shown in Figure 4(a). Figure 4(b) presents the full view in 360° synthesized by the Insta360 Pro. Resolutions of individual video and reconstructed video are \(3,840\times 2,160\) and \(1,920\times 960\), respectively, with an average recording rate of 30 fps. In addition, a surveillance camera is applied to monitor and recall with a full-scape view of the studio. A central microphone and three individual microphones are employed to record the audio data of the whole room and each subject.
Figure 4.
Figure 4. Example view of six cameras in \(360^{\circ}\) camera and the synthesized whole \(360^{\circ}\) view. Zoom in for a better view.
Two Azure Kinect DKs are utilized to collect RGB and depth videos simultaneously with an average fps of 30. Given that our lab possesses only two Kinect devices and Groups A, B, and C are collected simultaneously, two Kinects can only record the gestures of three participants in one group. Our research aims to investigate dynamic interactions involving regulatory triggers, and for this purpose, Group C, which includes one CT and three ET under full conditions, is the primary focus for analyzing and discussing the impact of these triggers on participants’ behavior and emotions. Therefore, the two Kinects are dedicated to Group C for data collection. Meanwhile, the gesture data of the three participants are estimated by Azure Kinect Body Tracking Software Development Kit. The two devices denoted as “master Kinect” and “slave Kinect” are synchronized. The master Kinect is set in front of the screen. It can record the gestures of two participants and the introduction video played on the screen, which could be utilized for synchronization in experiments. The Slave Kinect is set on the left side of the master Kinect. It can capture the gesture of the participants, which master Kinect cannot capture. Three sensor streams are aggregated in the Kinect device, including a depth camera, a color camera, and an inertial measurement unit. The Azure Kinect Viewer can visualize all the streams, as shown in Figure 5(a).
Figure 5.
Figure 5. Example of Kinect and physiological data with recording software. Zoom in for a better view.
Physiological data, including EDA, HR, and accelerometer, are captured by physiological sensors (Shimmer GSR3+) as shown in Figure 5(b). All the sensor devices are calibrated and synchronized with each other before each session. Sensors are attached to the participant’s nondominant hand so that the gel electrodes are placed on the palm’s thenar and hypothenar eminences. Real-time signals of students’ physiological activities are transmitted via Bluetooth connections to a monitoring laptop and supervised by a researcher. Before starting the data collection, the monitoring researcher ensures that all the sensors function correctly. All the signals are collected at the sampling rate of 128 Hz, which could be used to reveal new insights into the emotional and cognitive processes.
Although the above multimodal data offer promising capabilities for analysis, the synchronization of multiple modalities collected from different channels is challenging in both methodological and theoretical aspects. To reach the finest granularity synchronization possible, the data synchronization is planned before the official collection in which each data collection device clock is synchronized to record the Unix timestamp. The real-time timestamps are then used for data synchronization. The audio and video metadata are tracked with device-recorded Unix timestamps, while every record of physiological data was also associated with a specific Unix timestamp.
Finally, the Kinect data are synchronized with physiological data by the frame change of the video played during the introduction of the collaborative tasks. Specifically, the Kinect and \(360^{\circ}\) camera captured the task-introduction video played at the beginning of the collaborative task. Synchronization between Kinect data and \(360^{\circ}\) video data is achieved through frame changes in the introduction video. Additionally, the \(360^{\circ}\) video data are already synchronized with physiological data through timestamps. Therefore, by synchronizing the Kinect data with the \(360^{\circ}\) video data, we effectively synchronize the physiological data as well.

3.4 Data Statistics and Quality

Due to an unexpected hardware failure, the physiological data of nine participants and videos of three participants are lost. Eventually, the rest of 78 participants’ data are complete and are processed for analysis. Around 2,730 minutes of frontal facial videos and audio data are recorded from 78 participants. Twenty-eight \(360^{\circ}\) videos are collected through stitching the videos from six cameras. The participant region covers around \(600\times 750\), pixels and the facial region, comprising around \(180\times 200\) pixels on average, provides an adequate level of detail for facial analysis.
Around 630 minutes of Kinect data stream (RGB, depth, silhouette, and skeleton) are collected from 30 participants in Group C with CT and ET. In our collaborative learning scenario, where participants are seated, the range of lower body movements is limited. Therefore, the analysis of the upper body becomes a focal point. Specifically, we concentrate on the upper-body region, which typically spans approximately \(700\times 900\) pixels. Moreover, around 2,040 minutes of physiological data are collected, including HR, EDA, and accelerometer. The data statics of multiple modalities are presented in Table 2. The list of recorded file formats is shown in Table 3.
Table 2.
ModalityGroupSampleLength/minute
VideoFrontal facial video28782,730
\(360^{\circ}\) Video2828980
KinectRGB1018630
Depth
Silhouette
Skeleton
Audio28782,730
Physiological signalHR23662,310
EDA
Accelerometer
Table 2. Data Statistics of Multiple Modalities
Table 3.
SensorFile nameSignal
\(360^{\circ}\) CameraVideo_Day_Group_View_.aviVideo of individual
Video_Day_Group_.avi\(360^{\circ}\) video
KinectMaster_Day_Group_.mkvSkeleton data
Sub_Day_Group_.mkvSkeleton data
ShimmerPhysiology_Day_Group_Individual_.xlsxEDA, accelerometer, HR
MicrophoneDay_Group_Individual_.wavAudio
Table 3. List of Recorded Files

3.5 Ethics, Privacy, and Data Availability

Data collection, storage, and management were conducted in compliance with the GDPR [66]. Furthermore, all procedures about the dataset adhered to the ethical guidelines established by the Finnish National Board of Research Integrity, the All European Academies’ Code for Research Integrity, and the University of Oulu. Ethical approval was obtained from the Oulu University Ethics Committee (ID 4/21/Sanna Järvelä). Data collection imposes no disadvantages upon the participants. Participation is voluntary, allowing for withdrawal from the study at any time. Separate written ethical consent is mandated from both students and their guardians. Prior to giving consent, both parties are fully informed about the study’s objectives and data management practices (in accordance with GDPR). Pseudonymization through nonpersonal identifiers (created ID numbers) is completed for all data formats other than video and audio of the learning session, which must be analyzed with personal identifiers (likeness, voice) in place.
To promote relevant scientific development in the fields of computer science and learning science, the dataset or specific portions thereof may be made accessible to qualified researchers or research teams upon request. Access to the data will be facilitated through direct communication with the authors, who act as the data custodians. It is important to note that the release of data will be subject to the execution of a data transfer agreement, ensuring responsible use and compliance with ethical standards and legal requirements. Additionally, we published the metadata (available at https://etsin.fairdata.fi/dataset/69a92e8e-e4c6-4531-a2fb-d951fc5eac90) which can provide the database with a persistent identifier, landing page and distributes its description to other relevant services.

4 Data Annotation

MSSRL aims to provide data supporting multidisciplinary research for studying interactions in collaborative learning. Various annotation schemas have been applied to both verbal and nonverbal interactions in multimodalities [33, 50]. Verbal interactions are categorized into three levels based on audio data to facilitate the SSRL study. In the case of nonverbal interactions, comprehensive annotations are provided for facial expressions, gazes, gestures, and postures, contributing to the understanding of human communication and behavior. These annotations also serve as vital data for computer science applications, such as developing algorithms for emotion recognition, human–computer interaction, and automated understanding of social interactions, which are essential for future studies.

4.1 Verbal Interaction

For annotating theory-based meaning of verbal interactions, we adopted the human and Artificial Intelligence (AI) approach by Järvelä et al. [33] that integrated the unique strengths of both human and AI for micro-qualitative annotation of verbal interactions to study SSRL. The interactions were initially recorded and then transcribed in the original Finnish language. This task of transcription, along with the segmentation into individual speech turns, was carried out using Microsoft’s Azure Cognitive Services. Following the automated process, a validation phase was undertaken to ensure the reliability and accuracy of the data. Two human research assistants, Finnish-native and experienced in transcription techniques, independently reviewed and corrected the automatically generated transcriptions and speech segmentation. To provide a quantitative measure of the accuracy of the automated transcription, the study utilized the difflib Python library to compare the machine-generated text with the human-corrected version. This algorithmic comparison was performed on a total of 6,111 utterances from the conditions with both triggers and yielded a similarity score of 81.46%. This high level of congruence between the two versions not only corroborates the efficacy of automated transcription services but also accentuates the need for human oversight to capture the subtleties of natural language.
The same approach was applied to the translation process, in which each utterance was first automatically translated using Azure Cognitive Services. This machine-generated translation was then validated and corrected by a human research assistant.
Continuing with the annotation process, the dataset implemented a multilevel, theory-driven qualitative coding scheme to annotate each utterance with a meaningful label. This comprehensive qualitative annotation extended through three hierarchical levels to provide a nuanced understanding of the SSRL interactions. The annotation consists of three layers: (1) macrolevel concepts concerning types of interactions for regulation, (2) microlevel concepts focusing on the deliberative characteristics of the interaction, and (3) types of sentences. At the macrolevel, the focus was on categorizing the types of interactions that were primarily regulatory. The types of interactions were systematically organized and described in Table 4. This level offered a broad view of how participants engage in different forms of interaction that either facilitate or inhibit effective regulation.
Table 4.
Interaction for regulationDefinitionExample
Cognitive regulationCoded for interaction focus on higher learning-related thinking skills such as understanding, analyzing, reasoning, and evaluating that related to problem-solving decisions within the task.S1: Which of these would be, well, the pear has 60, but if I could get another one that would have been 60. That one has 30. S2: Well, here are the others, here are all the chia seeds, hazelnut spread, and whey protein powder.
Metacognitive regulationCoded for higher mental process toward metacognitive level include orienting, planning, monitoring, evaluating, and regulating the decision-making process (related to the task goal or collaboration process); potentially include substantive arguments such as the activation of prior knowledge or generating hypotheses. The connection and reflection would be made at a low level of task-related or higher level of processes, i.e., collaboration process, group-behavior process, etc.S1: Well, no, but OK, now it’s 50, 25, 25. S2: OK, now how do we get more. S3: Let’s raise each category a little, so it won’t change these ratios. (Monitor and suggest approach)
Socio-emotional regulationCoded for the expression of one’s emotion in a social context rather than directly related to the task or in interaction related to the task but there is clear evidence of synchronized socio-emotional reactions of group members (i.e., laughing, chuckling, etc.)S1: Fewer kilocalories. It doesn’t seem to work. (Annoyance that the solution doesn’t work) S2: Not when you take half of this almond drink (Providing reasons) S3: Well, it’s a difficult task. (Draw attention to the positive aspects of the challenging situation)
Task executionCoded for actions and interactions that primarily focus on carrying out task requirements, and completing the task—similar to cognitive interaction but without higher level thinking such as elaborated reasoning, analyzing, and evaluating.S1: Yeah, I’ll change them to one hundred and twenty-five. (Inform current process) S2: One hundred and twenty-five. OK, that should be twenty-five then. (Specify instruction)
Table 4. Regulatory Code
The microlevel of analysis further refined our understanding by focusing on the deliberative characteristics of each interaction [15, 16]. A detailed account of these deliberative characteristics is available in Table 5. This level of annotation allowed us to isolate and examine the subtle strategies and mechanisms individuals employ during interactions to collectively regulate their learning process.
Table 5.
Define the problemUnderstanding the problem, defining the present situation and the desired future to make the current issues or problems clearer to group members.
Establish strategiesSuggestions and implementations of specific process steps (approaches, techniques, or methods to process the tasks or to optimize the cognitive process).
Educate each otherFor back-and-forth discussions of group members trying to work on disagreements and align shared understanding by identifying and sharing understanding, information, interests, reasons, needs, and motivations.
Generate optionsBrainstorm and generating a solution for task-related problem-solving, offering alternatives of choices.
EvaluateMaking a judgment about different aspects of the collaboration, including the ideas, outcomes, group focus, and current progress.
Agree and implementConfirming a shared agreement on the options, ideas, and opinions and implementing it.
Attempt ideasApplying for testing out alternatives/solutions without forethought and discussion between group members.
MonitorObserving and checking different aspects of the collaboration: including the time, progress, result, quality of the procedure, environment, and group conditions.
Regulate group emotional monitoringInteractions with the intention of regulating group focus or emotional motivation about the situation.
Positive socio-emotional interactionPositive socio-emotional interactions without the intention of regulation.
Negative socio-emotional interactionNegative/neutral socio-emotional interactions without the intention of regulation.
Table 5. Deliberative Interactions
Lastly, at the base level, each utterance was classified according to the type of sentence used—whether it was a statement, a question, or other sentence types, as shown in Table 6. This basic classification served as a foundational layer that enabled more advanced layers of coding and interpretation.
Table 6.
Sentence typeDescriptionExample
Affirmative statementCoded for sentences that provide a positive confirmation or agreement without further content (Often comes after “confirmative question”).“Yes.” “Oh yeah, that’s it.”
Analyzing statementCoded for sentences that mainly explain or reason opinions and/or suggestions (Usually marked with “if…” in the statement and including “quantities”)“Well, not completely, we just take out the banana and mango, and when there’s that natural gum…”
Assertive statementCoded for sentences that communicate with a clear belief of action/next step (often including words like “need to”, “have to”, and “I would say” to express strong opinions, beliefs, or suggestions)“Honey melon removed.” (Confirming it’s done)
Confirmative questionsCoded for for yes/no questions or short answer questions that denote the idea of getting confirmative information“And was it still a milk allergy or what was it?”
Declarative statementCoded for sentences where the main purpose is to predominantly announce own action/current process.“I don’t remember how it was.”
Evaluating statementSimilar to analyzing statement but there is a good/bad value (measured up against certain standards)“It’s not good when the hazelnut spread has natural rubber so high, so all these ingredients should be low. We don’t have to change that base, and then, but then, you can always leave.”
Exclamation statementPredominantly coded for sentences that express emotions or feelings“What?!” (Express shock)
Filler sentenceCoded for words without much meaning that are often used as conjunctive, transition, or denote active listening.“Yes” (Not for the purpose of agreeing on a shared decision just acknowledgment of hearing the information) “Uhm”
Imperative statementCoded for commanding, requesting statement that starts with a verb or “let’s” (Often with the intention to guide other actions or suggest decisions)“Well, then read it to us.”
Informative statementCoded for statements (fact stating) that present information supporting problem-solving or decision-making process related to the task“In the ingredients and there’s a bar at the bottom, so you move it beam there to the right less.”
Interjecting sentenceCoded when the individual interrupts another mid-sentence or an abrupt remark“Natural rubber and milk protein, right?” “Whey protein?” (abruptly spoke)
Interrogative questionsCoded for long-answer questions that are often marked with “what,” “how,” and “why” regarding the other opinion, what they think, ideas, etc.“Well, we don’t have that, well, the ice cream has to be changed to something else.” (disagreeing) “Then what, natural rubber?”
Negative statementCoded for any sentence that contains not, no, contraction, negative adverb, or negative subjective—on the line of correcting a misunderstanding“that, that is not two hundred and fifty grams. That’s not even close to 250.”
Rhetorical questionsCoded for questions that don’t impose a need for answer“…protein allergy. Does it mean…” (stating question for the purpose of humor)
Table 6. Sentence Type
Firstly, both annotators underwent extensive training on the annotation framework and engaged in calibration sessions to align their understanding and application of the annotation criteria [7]. Secondly, the annotation process included iterative reviews where both annotators discussed and resolved discrepancies, thereby enhancing the consistency of their annotations. Thirdly, despite the common practice of employing at least three annotators, we conducted a commonly used inter-annotator agreement analysis to assess the reliability of the annotations. This analysis was conducted on a 20% sample of our dataset and provided a quantitative measure of consistency between annotators, and the results were found to be within acceptable ranges for qualitative research, indicating a high level of agreement. Such a method aligns with established practices in several research fields, specifically in learning sciences research, as evidenced by its application in highly regarded journal articles, including the study by Järvenoja et al. [35].
These annotations provide a structured approach for analyzing verbal interactions across various levels of abstraction, employing different theoretical perspectives from the learning sciences at both group and individual levels. They enable researchers to dissect how communication contributes to both SSRL and self-regulation of learning in group settings. Furthermore, the three-level abstraction of these annotations aids in identifying specific group- or individual-level verbal behaviors that facilitate or hinder effective learning, offering valuable insights for developing targeted interventions to improve collaborative learning outcomes. The macrolevel analysis focuses on forms of interaction and the microlevel analysis deepens this by looking into specific interaction characteristics. These annotation levels help to identify especially group- or peer-level interaction processes. The base level focuses on individual utterances and identifies specific sentence types to reveal individual-level processes, such as how individual learners contribute verbally to the group-level interaction processes. In all, this approach enables a comprehensive examination of interaction processes and mechanisms for learning regulation at different granularities.

4.2 Facial Expression

Socio-emotional interactions reflect fluctuations in learners’ participation in terms of emotional expressions, which enables a comprehensive understanding of collaborative learning processes. Recognizing individuals’ emotions through facial expression recognition by AI within this context is essential, and emotion annotation during triggers serves several vital research purposes [40, 41, 43]. Firstly, it allows for the investigation of how CT and ET impact collaboration, revealing their effectiveness in shaping interactions. Secondly, emotion annotation assists in assessing AI’s capability to accurately identify facial expressions during interactions, a task that is both time-consuming and labor-intensive for humans. The data 30 seconds before and after every trigger are annotated with three emotion categories, including negative, positive, and neutral. The annotation process is conducted in three steps. First, we extract the frames around the CT and ET and roughly crop the facial regions to make the facial expressions easy to follow. Second, 10 annotators work independently after a preparatory course. Each annotator is required to annotate three trigger clips. Labels are assigned to every trigger clip in seconds instead of frames because the emotion changes in an evolutionary manner. A tool is developed to play the frames in seconds for annotation continuously. Finally, 10 annotators participated in the annotation of the video data. Each video was annotated by three annotators independently. The final annotation was decided based on the emotion category that received the most votes among the three annotators.
The distribution of emotions among participants is as follows: positive emotions: \(16.01\%\), negative emotions: \(4.06\%\), neutral emotions: \(79.93\%\). This distribution indicates that the majority of participants’ emotional expressions fall within the “neutral” category, with smaller percentages expressing “positive” and “negative” emotions. This information provides valuable insights into the emotional dynamics of collaborative learning, indicating that the learning environment may generally be characterized by a sense of neutrality and composure among participants.

4.3 Eye Gaze

In the multimodal dataset for interaction, the annotation of eye gazes occupies a critical role in analyzing nonverbal elements in communicative processes. Eye gaze annotation is invaluable for understanding communication dynamics, cognitive processes, and learning engagement [22]. It captures subtle cues, such as gaze shifts and patterns, revealing attentional focus, engagement levels, and comprehension strategies. In our collaborative learning settings, leveraging eye gaze enhances SSRL and group dynamics. Monitoring engagement through gaze patterns allows educators to identify and intervene with disengaged learners, ensuring active participation. Eye gaze cues facilitate smooth turn-taking, fostering equitable involvement and a supportive atmosphere. Encouraging learners to maintain eye contact promotes active listening, enhancing communication and understanding. Gaze signals also support social regulation by conveying interest, agreement, or disagreement, aiding collaborative problem-solving. Integrating gaze feedback provides real-time insights into group dynamics and fostering reflection. Moreover, eye gaze data aids peer assessment, offering objective indicators of participation and engagement for accountability. Overall, utilizing eye gaze in collaborative learning enhances SSRL by promoting engagement, facilitating communication, and supporting social and cognitive regulation processes among learners.
This study utilized a theory-driven manual annotation approach for eye gazes. Experienced coders annotated instances of gaze to identify focus areas and duration, guided by pre-established theoretical frameworks that consider the importance of gaze in signaling attention, cognitive load, or emotional state. The ELAN software was used for the task of segmenting and coding the eye gazes in the multimodal dataset. The inter-coder reliability checks are applied where multiple coders annotate the same data independently to ensure consistency. The types of eye gazes were coded including partner-oriented gaze, object-oriented gaze, iconic gaze, and others, as shown in Table 7.
Table 7.
CategoryDefinition
Partner-oriented gazeThe participant is engaged in gaze behavior toward one of their partners
Object-oriented gazeThe participant is engaged in gaze behavior toward an object
Iconic gazeHand/arm movements that bear a perceptual relation with concrete entities and events
OtherAny kind of gaze behavior that is not directed to the categories described here
Table 7. Gaze Annotation
These manually annotated eye gaze metrics are integrated into the overall analysis of interactions, providing a more complete view of how individuals interact. By combining this information with verbal and other nonverbal cues, the dataset offers a richer analysis of interactive behavior, making it a valuable resource for studying socially shared regulation in collaborative learning settings.

4.4 Gesture and Posture

Gesture and posture in a collaborative learning setting is a valuable methodology for understanding the nuances of nonverbal communication, engagement, and the learning experience among participants [26, 67]. Gestures refer to body movements or actions to convey a message, express an emotion, or emphasize a point in communication. Four types of gestures are annotated based on McNeill’s classification [50]: deictics, beats, iconics, and metaphorics for interaction studying in a collaborative learning setting, as shown in Table 8. These categories are valuable for studying and analyzing interactions in a collaborative learning setting, where gestures play a role in enhancing understanding and engagement.
Table 8.
CategoryDefinition
Beat gestureHand/arm movements that are non-pictorial
Deictic gestureOpen hand and close hand pointing
Iconic gestureHand/arm movements that bear a perceptual relation with concrete entities and events
Metaphoric gestureAbstract content is given form in the imagery of objects, space, movement, and so forth using hand/arm movements
Table 8. Gesture Annotation
Postures are defined as a position of the body or of body parts [49]. Postural orientation and slumped/upright positions have been provided for understanding nonverbal communication and human behavior. Specifically, postural orientation can indicate the direction of one’s body or body parts, such as facing to the right or left, while slumped/upright annotations capture the alignment of the body along the vertical axis, from a slouched position to an upright one. These annotations provide valuable data for studying body language and its implications in various contexts, from psychology and sociology to human-computer interaction.

5 Verification of Dataset Effectiveness

Considering that multimodality and the designed regulatory triggers for SSRL study are two distinctive characteristics of the proposed dataset against previous work, we conducted several preliminary experiments to verify the effectiveness of our dataset and reveal its potential in analyzing human interactions. Specifically, this section evaluates the effectiveness of our designed CT and ET and their impacts on interactions in terms of individuals and groups in multimodalities.

5.1 Evaluation on Designed Regulatory Triggers in Terms of Individuals

In this subsection, we investigate the influence and effectiveness of our designed regulatory triggers on individuals in terms of continuous emotion (valence), discrete emotions from facial expressions, gestures, and psychological signals.

5.1.1 Emotional Valence during Different Triggers.

Emotional valence reflects the negative or positive degree the participant feels, and emotional arousal indicates the heightened physiological activity, referring to how strong the emotion is. We calculate the averaged valence of the participants during different triggers based on the self-reported emotional valences. Figure 6 illustrates the valence on the cognitive trigger (denoted as “CT”) and three emotional triggers (denoted as “ET”), respectively.
Figure 6.
Figure 6. Valence in average during triggers. CT and ET represent the cognitive trigger and emotional trigger, respectively.
The CT generally aroused the highest valence, with an average valence score of 6.33. It surpasses the highest valence aroused by the ET by 0.2, which indicates that the ET “hurry up” influences the emotion and decreases the positive emotion, compared to the CT. For the three ET, the first ET, “ET1,” achieves the highest value in valence. From the first ET to the second ET, “ET2,” the valence declines from 6.13 to 6.05. Then the valence keeps to 6.05 on the third ET, “ET3.” It means that “hurry up” in ETs imposes pressure on participants. With the increase of the ETs, the pressure or nervousness increases, and the valence declines accordingly. However, the decrease in the valence is slight and remains positive all the time. One possible reason is that the participants understood it is just an experiment instead of a real scene of a smoothie store. Nevertheless, this phenomenon verifies that our designed ET impose pressures on participants.

5.1.2 Facial Expression Changes around Activated Triggers.

Three fundamental discrete emotions are considered to investigate the SSRL, including positive, negative, and neutral. The emotional change around the triggers in 1 minute is studied to verify the designed triggers’ effectiveness. Precisely, the distribution of facial expressions before and after triggers in the 30 seconds is calculated to study the effect of CT and ET on emotional changes. As shown in Figure 7, there are more positive facial expressions before activated CT than ETs. It is consistent with the valence of CT and ETs. It further indicates that the triggers arouse pressure on participants. Moreover, it is believed that pressure can further influence interactions.
Figure 7.
Figure 7. Emotion distribution before and after triggers. “\(-\)” and “\(+\)” represent before and after trigger, respectively. CT and ET represent the CT and ET, respectively.
On the other hand, the percentage of positive emotion increases when triggers occur for both CT and ET. Remarkably, the positive facial expression percentage for ET1 is raised by \(25\%\) compared with the facial expression before the trigger, which is consistent with what is observed during the experiment. It means that the customer’s special request in CT and “hurry up” in ETs did not impose much pressure on the participants when they heard the customer’s requests and “hurry up.” Besides, it is not a real scene of a smoothie store; the participants are teenagers in high school who are lively and laughing [76]. When they hear novel things, they feel interested and are likely to laugh. Moreover, the raised percentage of positive facial expressions declines with the ET overlaying. The decrease in raised positive facial expressions indicates that the impact on participants decreases over the ET. As the same kind of ET appears multiple times, the participants get used to the ETs, and the intensity of the responses decreases.
Furthermore, Chi-square results also confirm the significant difference in the emotion distribution around different types of triggers, \({\chi}^2(6, N = 5,940) = 27.52, \rho < 0.001.\) In general, from the valence and discrete emotion changes across the triggers, our designed triggers impact the emotions of the participants. The emotion will further influence the interaction [29, 79]. Positive emotion is likely to lead to positive socio-emotional interactions in collaborative learning [44]. Previous research [84] found that positive emotional interactions are related to better collaboration. It is believed that our study can facilitate collaborative learning research.

5.1.3 Facial Expression Recognition Baselines.

This study presents baseline results for facial expression recognition using state-of-the-art methods, including Emotion Face-Alignment Network (EmoFAN) [79] and Multi-task Assisted Correction (MTAC) method [46], which can extract both discrete and continuous emotions from 2D frontal video data.
EmoFAN [79] is built on the top of the face-alignment network to predict fiducial landmarks, discrete emotional classes, and continuous affective dimensions on the face in a single pass. EmoFAN provides two models pre-trained on the AffectNet dataset [54]. The two models are in five emotional classes (Neutral, Happy, Sad, Surprise, and Fear) and eight emotional classes (Neutral, Happy, Sad, Surprise, Fear, Anger, and Contempt), denoted as “EmoFAN5” and “EmoFAN8,” respectively. The pre-trained models are evaluated on the cleaned AffectNet test set. The accuracy on five and eight emotions are \(82\%\) and \(75\%\), respectively. The valence and arousal with models trained for five and eight emotions are 0.90 and 0.80, and 0.82 and 0.75 in terms of concordance correlation coefficient (CCC), respectively [13].
MTAC is a novel method of multitask assisted correction in addressing uncertain facial expression recognition [46]. A confidence estimation block and a weighted regularization module are utilized to suppress uncertain samples to improve their robustness. On AffectNet, it achieves \(65.80\%\) in seven-category classification and CCC values of 0.758 for valence and 0.649 for arousal.
Before analysis, face detection is conducted with Dlib and face alignment based on 68 facial landmarks is employed to alleviate subjective variation [14]. Three pre-trained models are exploited, i.e., EmoFAN5, EmoFAN8, and MTAC, using the annotated facial expressions from video data around triggers and providing baselines for the task of emotion recognition. For discrete categories, since three emotion categories are considered during annotation, the emotion extracted by pre-trained models are mapped to Positive, Negative, and Neutral according to [37, 78]. The Happy is mapped to Positive; the Sad, Fear, Anger, Contempt, and Surprise are mapped to Negative. Note that the Surprise emotion is assigned to the negative category in this article, as participants are supposed to show surprise expressions when they hear trigger instructions. For example, the trigger instructions imposing unexpected triggers (“Hurry up”) on the learning process might lead to negative emotions. During the annotation procedure, the post-observation of experts also demonstrates this assumption.
In the case of discrete categories, the EmoFAN5, EmoFAN8, and MTAC achieve an accuracy of 51%, 48% and \(45\%\), respectively. The EmoFAN outperforms MTAC, and the EmoFAN5 achieves the best performance. One possible reason is that MSSRL is uncontrolled and has various head poses and occlusions. Participants always look down at the laptop screen during the learning procedure leading to significant face deformation. Previous studies have highlighted the challenging nature of facial expression recognition, particularly in real-world scenarios [80, 87]. While existing methods demonstrate excellent performance on datasets, their effectiveness tends to diminish in practical settings, mainly due to challenges such as variations in illumination and pose. The EmoFAN aggregating face-alignment task can alleviate head poses and occlusions. The results suggest that the collected dataset MSSRL has head pose and face occlusion factors, which could provide a challenging collaborative learning scenario in facial expression recognition and other facial analysis tasks. In addition, we compare the valence with the emotion value reported by the participant to validate the continuous emotion. Since the participants tend to report the emotions at the highest intensity after hearing the trigger, the predicted valence scores at the highest arousal after the trigger in 10 seconds are compared with the self-reported valences. Since the participant reports one value at each trigger, the average L1 norm distance is utilized to evaluate the predicted valence. The distance between predicted and self-reported valence is 0.44, 0.44, and 0.46 on EmoFAN5, EmoFAN8, and MTAC, respectively. The results further indicate that the EmoFAN based on the face-alignment task has strength in uncontrolled situations with various head poses and occlusions.

5.1.4 Gesture Speed Changes during Triggers.

Similar to facial expressions, body gestures have multiple signaling functions in interactions. Observing human gestures serves as a diagnostic tool for team members. As a robust, semantical meaningful, and computationally inexpensive representation, the body skeleton data are used for studying human gestures. In our dataset MSSRL, the skeleton is extracted using the Kinect Azure devices, as shown in Figure 8. Since engagement plays a vital role in interaction [60] and the speed of motion is an appropriate measure of engagement [51], we measure the engagement level of each subject using the speed of the gesture. The speed of gesture is the summed movement speed of all joints, calculated by the Euclidean distance of all joints moved within every second. Specifically, we compute the joint movement speed during the extraction using three sets of joints, i.e., full-body joints with the head, upper-body joints including the head, and upper-body joints without the head.
Figure 8.
Figure 8. Examples of the skeleton extracted from master and subordinate Kinect, respectively.
Figure 9 illustrates the average gesture speed before and after trigger events over 1 minute. The results indicate a tendency for gesture speed to decrease following each trigger event. One explanation is that gestures can reflect human cognitive and emotional states [12, 81]. The participants may take some time to pause and think following the triggering event, which is an interesting observation in the context of the study. In addition, the results indicate our designed triggers have a significant impact on gestures.
Figure 9.
Figure 9. The average gesture speed difference before and after triggers. CT and ET represent the cognitive trigger and emotional trigger, respectively.

5.1.5 Physiological Data during Triggers.

Physiological measures including EDA and HR have gained attention in the study of learners’ cognitive and emotional processes in collaborative learning, particularly within the domain of SSRL research. EDA, reflecting changes in skin conductance due to sweat gland activity, serves as an indicator of emotional and cognitive arousal [20, 34, 58] which can be integral in understanding the dynamics of collaborative learning. Similarly, HR variability is linked to cognitive and emotional processing [73], offering insights into the physiological underpinnings of group learning processes. In the context of SSRL, these measures can provide objective data on learners’ regulatory processes, complementing self-reports and observational data [48], thus enriching our understanding of the complex interactions within learning groups.
To extract valuable insights from EDA data, we employed a systematic approach using the Ledalab toolbox to compute and label skin conductance responses (SCRs) for the dataset. Initially, we conducted a visual examination of the data to identify any instances where electrode contact may have been compromised. Following this, we addressed minor movement-related distortions within the signal by implementing a Butterworth low-pass filter, specifically setting the frequency at 1 Hz and the filter order at 5. For the detection of SCR peaks, we utilized a threshold of 0.05 microsiemens (µS), applying continuous decomposition analysis to accurately identify these occurrences. Additionally, we separated the EDA signal into its phasic and tonic components, facilitating a more nuanced analysis of the physiological responses under investigation. An example of a learner’s EDA around the regulatory triggers is illustrated in Figure 10. The learner’s physiological activation showed a slight increase after the CT and a pronounced surge following the first ET, further underscoring the effectiveness of the designed triggers.
Figure 10.
Figure 10. Example of a learner’s EDA around the regulatory triggers[33].

5.1.6 Emotional Verbal Interaction during Triggers.

The emotional verbal interactions during triggers have been investigated according to the deliberative interaction annotations introduced in Section 4.1. Given there is a 3-minute duration between triggers, the verbal interactions before and after 1 minute are taken into consideration. Specifically, this examination focuses on positive socio-emotional interactions without the intention of regulation and negative socio-emotional interactions without the intention of regulation.
Consistent with facial expressions, the duration of positive socio-emotional interaction increases when triggers occur, for both CT and ET. Remarkably, the positive socio-emotional interaction for ET2 increases by \(15\%\), compared to the positive socio-emotional interaction before the trigger. These findings align with the facial expression changes observed during triggers in Section 5.1.2, supporting the notion that positive emotions are likely to lead to positive socio-emotional interactions in collaborative learning [44].

5.2 Evaluation on Designed Regulatory Triggers from Individuals to Groups

In this subsection, we investigate the effectiveness of triggers on collaborative learning groups by analyzing their effects on emotion and gesture speed.
Hypothesis: The CT and ET in collaborative learning environments have significant effects on group members’ cognitive processes and emotional states, which can be reflected by the change in facial expressions and gestures. Specifically, the presence of designed triggers is hypothesized to lead to changes in the facial expressions and behaviors of group members, indicating variations in responses over CT and ET.
\({H_{0}}\): There is no significant difference in the facial expressions and gestures of group members between CT and ETs, indicating constant emotional states irrespective of the trigger (\(\rho > 0.05\)). \({H_{a}}\): There is a significant effect, operationalized as a nonzero regression coefficient for the trigger variable in the model (\(\rho\leq 0.05\)).
In our collaborative learning scenario involving potentially multilevel data, where individuals are nested within groups and responses are recorded at different times, assessing variance across these levels is essential. Therefore, multilevel models are employed to analyze the emotion and gesture in groups. A multilevel model, also known as a hierarchical linear model or mixed-effects model, is a statistical model used to analyze data with a nested or hierarchical structure. It allows for the modeling of variance at multiple levels of the hierarchy. Multilevel analysis typically begins by examining the unconditional or one-way random effect analysis of variance [53, 65]. It estimates the variance structure using a parsimonious, parametric approach. One of the advantages of second or higher levels of multilevel modeling is the ability to predict cross-level effects which offers an alternative research strategy compared to analyzing each level separately, which is typically done by computing the intra-class correlation coefficient (ICC) for the dependent variable [72]. ICC scores were calculated using SPSS package version 29.

5.2.1 Facial Expressions in Groups among Triggers.

To evaluate the different effects of CT and ETs on group emotions within collaborative learning, we initiated our analysis by calculating the ICC. Specifically, we examined how participants’ emotions changed within groups in response to different triggers. The emotions were categorized as 0 (Neutral), 1 (Positive), and 2 (Negative) and the dependent emotion variable’s ICC value is higher than 1%. Therefore, three-level multilevel modeling was used, which ensures a useful framework for thinking about problems with this type of hierarchical structure: Level 1 (time: participants’ emotions at different trigger times); Level 2 (individuals), and Level 3 (groups in collaborative learning).
Our results show that triggers had a positive and statistically significant (\(\rho_{e}\leq 0.05\)) association with the facial expressions (Estimate = \(-\)0.015, SE = 0.006, t(5,939) = \(-\)2.244, \(\rho_{e}\) = 0.025). This illustrates that various trigger events had a positive and statistically significant impact on facial expressions, further confirming the effectiveness of designed triggers in influencing group emotions.

5.2.2 Gestures in Groups among Triggers.

To study the association of group gestures with triggers, we utilized multilevel statistical models to explore the relationship between triggers and gesture speed which can reflect the engagement of the participants [51, 60]. For gestures, two-level multilevel modeling was used since its ICC value in Level 3 is nearly zero. The results demonstrate a statistically significant difference in participants’ gesture speeds among different triggers (Estimate = \(-\)0.167, SE = 0.050, t(4,461) = \(-\)3.365, \(\rho_{g}\) = 0.01). It means that our designed triggers have significant impacts on gesture speeds.
Overall, our designed CT and ET are effective and can influence interactions.

5.3 Evaluation of the Effectiveness of Multimodal Analysis

Apart from the single modal analysis above, exploring whether there is an association between different modalities is exciting and necessary. To obtain a holistic view of all the trigger moments, we exploit continuous emotional states and body moving speeds around each trigger time-point and use data of G6 as an example for visualization. Figure 11 shows valence, arousal, and skeleton movement trends of Group 6 in our dataset. For the valence, significant changes appear after applying every trigger, which confirms the result of Section 5.1.3. Situations of arousal are more complicated in this case. Inconsistent variations of each participant in different triggers further demonstrate the dynamic and diversity of interactions. The skeleton movement presents the characteristics of trigger-affected changes similar to the valence and shows that the effect of the same trigger decreases as the number of applications increases. Regardless of the modality or trigger, their trends correspond to previous correlations of emotion and skeleton. Similar consistency exists for other groups of this dataset, demonstrating our dataset’s potential value in analyzing interactions.
Figure 11.
Figure 11. Trend visualization of valence, arousal, and skeleton movement around CT and ET. CT and ET represent the cognitive trigger and emotional trigger, respectively. Zoom in for a better view.

6 Discussion

In this section, we discuss our findings and contributions and future illustrates the limitations and future work.

6.1 Contributions to Research on Multimodal Interactions for SSRL

6.1.1 Cognitive and Socio-Emotional Interaction.

Examining cognitive and socio-emotional interactions in the context of SSRL, group regulation in collaborative learning is crucial for both advancing learning sciences theories and designing effective supports for learners in collaborative settings. However, more than existing datasets for interactions study would be required to develop methods for examining the underlying mechanism of cognitive and socio-emotional interactions within the context of SSRL. Accordingly, based on the novel concept of trigger events for regulation [33], this study has provided a multimodal dataset with designed triggers to regulate emotional and cognitive processes during interactions. This dataset has significant implications for further methodological development and theoretical advancement in researching and understanding the dynamic interaction mechanisms that govern SSRL in collaborative learning.
Our dataset enables the investigation of cognitive and socio-emotional interactions through the lens of strategically designed regulatory triggers that can inform the development of targeted support mechanisms for learners confronted with challenges in collaborative settings. The analysis of cognitive and socio-emotional interactions in relation to regulatory triggers has several far-reaching implications for both theory and practice. First, the identification of cognitive and socio-emotional interactions associated with such triggers can serve as a diagnostic tool for educators and facilitators to predict points of difficulty within collaborative activities. This predictive utility can then be operationalized through educational technology, such as intelligent tutoring systems, to offer timely interventions that guide groups through cognitively or emotionally challenging scenarios [1]. Second, understanding these triggers can enhance the design of collaborative platforms [33]. Systems can be engineered to provide dynamic scaffolding, and tailoring assistance based on the type of trigger encountered, whether it is cognitive or socio-emotional. This leads to more effective support for both self-regulated learning and SSRL [2]. Third, from a pedagogical standpoint, curricula can be designed to include explicit training on recognizing and responding to these triggers, thereby equipping learners with the metacognitive and emotional regulation skills necessary for effective collaboration. Lastly, for researchers, focusing on interactions associated with these triggers offers a refined unit of analysis for investigating the complex interplay between cognitive and socio-emotional processes in collaborative settings.

6.1.2 Multimodalities.

Our dataset provides multimodal data for collaborative learning, including facial videos, audio, physiological signals (including EDA, HR, and accelerometer), and Kinect data (RGB, depth, silhouette, and skeleton. Multimodal analysis is essential for studying human interaction as it offers a comprehensive perspective on communication. It enriches data representation by considering various channels like speech, gestures, and facial expressions, enabling a deeper comprehension of interactions. Moreover, it facilitates contextual understanding by allowing different modalities to complement each other. For instance, nonverbal cues such as facial expressions can provide insight into the emotional tone of spoken words, while gestures can elucidate the meaning of written text.
Multimodal analysis is highly beneficial in collaborative learning settings, revolutionizing the educational experience [57, 85]. One prominent application is engagement monitoring in online education [8, 28], where digital platforms harness a blend of video feeds and interaction data to gauge student engagement levels. Through the analysis of facial expressions, eye gaze, mouse clicks, and keystrokes, these platforms can discern the moments when students are fully engaged or when their attention wanes. This real-time insight empowers educators to adapt learning content dynamically, offer additional support, or suggest timely breaks to reengage students effectively. Furthermore, peer assessment and feedback in group projects stand to gain significantly from multimodal analysis [23]. Instructors and AI systems can comprehensively evaluate collaboration quality during virtual group meetings by analyzing audio recordings, text chats, and screen sharing. This holistic assessment encompasses factors such as the distribution of speaking time, the quality of discussions, and individual contributions, resulting in more accurate peer evaluations and fairer grading processes. Additionally, multimodal analysis demonstrates its worth in special education, particularly in nurturing social skills development [11]. Educators can provide real-time feedback to students with autism or social challenges by tracking facial expressions, body language, and vocal intonations during social interactions. This approach contributes significantly to enhancing communication skills and fostering the ability to recognize and respond to social cues effectively.

6.2 Interdisciplinary Approach

Another substantial contribution of this study relates to the interdisciplinary approach with preliminary results to examine the utility of the proposed dataset. Our results reported a significant difference in emotion and gesture change in groups among different CT and ET. Our findings support that external events would dynamically influence students’ learning interactions [73].
Moreover, we provided annotations on verbal interaction, facial expression, gaze, gesture, and posture. These annotations can serve as training data for machine learning models, particularly those related to computer vision and multimodal data analysis. Researchers can use this resource to create and fine-tune AI algorithms for tasks such as emotion recognition, human–computer interaction, and more [42, 47, 86]. The dataset encompasses challenging real-world situations. Challenging scenarios are valuable for testing the robustness and effectiveness of AI algorithms. They can help researchers develop models that perform well in real-world, less-controlled environments.
Our interdisciplinary approach in this presented study also responds to the recent calls for interdisciplinary efforts bridging learning sciences, sociology, machine learning, and computer science to maximize the impact of multimodal data and advanced techniques in examining and supporting emotional and cognitive processes. This article contributes to the field of computer sciences by offering a novel dataset for multimodal model development, the field of learning sciences by providing new insights into the trigger moments for cognitive and emotional processes in collaborative learning, and the field of sociology by studying regulatory trigger influences on interactions to develop interactive intelligent systems.

6.3 Limitation and Future Work

Since this is a preliminary study of multimodal analysis for SSRL, several limitations should be addressed in our future work. One thing is about the annotation. The dataset includes annotations for verbal interactions, gazes, gestures, and postures. These annotations are firmly rooted in theoretical frameworks drawn from both computer sciences and learning sciences, thereby enhancing the multidisciplinary nature of our research. We anticipate that these extensive annotations will enhance our ability to achieve a more detailed and nuanced comprehension of the data. However, it’s important to clarify that the above annotations underwent a reliability assessment carried out by two independent coders. Although we were unable to involve a third annotator across the various annotation categories, as recommended, we acknowledge this as a limitation of our study. Despite rigorous annotation from the viewpoints of both computer sciences and learning sciences and a carefully executed reliability test involving two independent coders, the incorporation of a third independent coder could potentially enhance the reliability of these annotations further.
Besides, we only recorded the Kinect data of three participants in Group C due to equipment constraints. Although the Kinect data of Group A and Group B have not been collected, we can also analyze the upper-body gestures of participants in Groups A and B collected by \(360^{\circ}\) cameras in the future.
Notably, the primary focus of this work is on the data collation, categorization, description, and validation of dataset effectiveness through the analysis of the designed triggers, rather than the evaluation or proposal of computational models to interpret the data. Therefore, the predictive capabilities of the dataset in terms of qualitative annotations have not been explored in the current study. A crucial avenue for future research involves the establishment of baseline predictions for the qualitative annotations within the dataset. Such baselines would serve as empirical benchmarks for comparing and evaluating the performance of subsequent predictive models. This is a nontrivial task, given the complexities inherent in multimodal interactions, and represents an essential next step for fully leveraging the utility of the dataset in both computational and educational contexts.
Another limitation is that this work only analyzes interactions influenced by triggers. The relationship between emotional and cognitive processes in the whole process should be further explored.

7 Conclusion

This article introduces a novel multimodal dataset specifically designed to study interactions. The dataset contains 81 video clips from individual learners, annotated for three emotion labels around the regulatory trigger events, 28 360-stitched videos for learning groups, 18 Kinect data streams, and 66 physiological signals. We respond to the recent calls to utilize multimodal data and advanced machine learning technologies to reveal the “unobservable” emotional and cognitive processes during the interaction. Special regulatory triggers have been designed to further study their influence on emotional and cognitive interaction among multiple people. Furthermore, this article also demonstrates an interdisciplinary approach with multimodal data to examine interactions which would be meaningful to develop human-centered interactive intelligent systems.

Appendix

The specific items in pre-situated self-reports are shown in Table 9. Table 10 demonstrates the report of the participants’ emotion during triggers. Moreover, the post-situated self-reports are asked to be completed, as shown in Table 11.
Table 9.
At this time my feelings areThis task appearedThis task appearedHow certain are you that you will reach the objective of this task?
Extremely negative 1Boring 1Not at all difficult 1Not at all certain 1
Very negative 2222
Fairly negative 3333
Somewhat negative 4444
Neutral 5Neutral 5Neutral 5Neutral 5
Somewhat positive 6666
Fairly positive 7777
Very positive 8888
Extremely positive 9999
 Interesting 10Extremely difficult 10Extremely certain 10
Table 9. The Items in Pre-Self Report
Table 10.
VideosQuestionsAnswer (1–9)
Video 1My feelings were 
My groupmate’s (name) feelings were 
My groupmate’s (name) feelings were 
Video 2My feelings were 
My groupmate’s (name) feelings were 
My groupmate’s (name) feelings were 
Video 3My feelings were 
My groupmate’s (name) feelings were 
My groupmate’s (name) feelings were 
Video 4My feelings were 
My groupmate’s (name) feelings were 
My groupmate’s (name) feelings were 
Table 10. The Report of Emotions During the Triggers
Specifically, the answer 1–9 represents valence from extremely negative to extremely positive.
Table 11.
At this time my feelings areEstimate your mental investment to this taskThis task appearedThis task appearedHow certain are you that you have reached the objective of this task?
Extremely negative 1Extremely small 1Boring 1Not at all difficult 1Not at all certain 1
Very negative 2Very small 2222
Fairly negative 3Small 3333
Somewhat negative 4Relatively small 4444
Neutral 5Neither small or large 5Neutral 5Neutral 5Neutral 5
Somewhat positive 6Relatively large 6666
Fairly positive 7Large 7777
Very positive 8Very large 8888
Extremely positive 9Extremely large 9999
  101010
Table 11. The Items in Post-Self Report

References

[1]
Vincent Aleven, Bruce M. Mclaren, Jonathan Sewall, and Kenneth R. Koedinger. 2009. A new paradigm for intelligent tutoring systems: Example-tracing tutors. International Journal of Artificial Intelligence in Education 19, 2 (2009), 105–154.
[2]
Roger Azevedo. 2014. Issues in dealing with sequential and temporal characteristics of self-and socially-regulated learning. Metacognition and Learning 9 (2014), 217–228.
[3]
Michal Balazia, Philipp Müller, Ákos Levente Tánczos, August von Liechtenstein, and Francois Bremond. 2022. Bodily behaviors in social interaction: Novel annotations and state-of-the-art evaluation. In Proceedings of the 30th ACM International Conference on Multimedia, 70–79.
[4]
Elizabeth F. Barkley, Claire H. Major, and K. Patricia Cross. 2014. Collaborative Learning Techniques: A Handbook for College Faculty. John Wiley & Sons.
[5]
Cigdem Beyan, Francesca Capozzi, Cristina Becchio, and Vittorio Murino. 2017. Prediction of the leadership style of an emergent leader using audio and visual nonverbal features. IEEE Transactions on Multimedia 20, 2 (2017), 441–456.
[6]
Cigdem Beyan, Nicolo Carissimi, Francesca Capozzi, Sebastiano Vascon, Matteo Bustreo, Antonio Pierro, Cristina Becchio, and Vittorio Murino. 2016. Detecting emergent leader in a meeting environment using nonverbal visual features only. In Proceedings of the 18th ACM International Conference on Multimodal Interaction, 317–324.
[7]
Steven Bird, Ewan Klein, and Edward Loper. 2009. Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit. O’Reilly Media, Inc.
[8]
Santi Caballé. 2015. Towards a multi-modal emotion-awareness e-Learning system. In Proceedings of the 2015 International Conference on Intelligent Networking and Collaborative Systems. IEEE, 280–287.
[9]
Angelo Cafaro, Johannes Wagner, Tobias Baur, Soumia Dermouche, Mercedes Torres Torres, Catherine Pelachaud, Elisabeth André, and Michel Valstar. 2017. The NoXi database: Multimodal recordings of mediated novice-expert interactions. In Proceedings of the 19th ACM International Conference on Multimodal Interaction, 350–359.
[10]
Jean Carletta, Simone Ashby, Sebastien Bourban, Mike Flynn, Mael Guillemot, Thomas Hain, Jaroslav Kadlec, Vasilis Karaiskos, Wessel Kraaij, Melissa Kronenthal, Guillaume Lathoud, Mike Lincoln, Agnes Lisowska, Iain McCowan, Wilfried Post, Dennis Reidsma, and Pierre Wellner. 2005. The AMI meeting corpus: A pre-announcement. In Proceedings of the International Workshop on Machine Learning for Multimodal Interaction. Springer, 28–39.
[11]
Jingying Chen, Dan Chen, Xiaoli Li, and Kun Zhang. 2013. Towards improving social communication skills with multimodal sensory information. IEEE Transactions on Industrial Informatics 10, 1 (2013), 323–330.
[12]
R. Breckinridge Church, Martha W. Alibali, and Spencer D. Kelly. 2017. Why Gesture? How the Hands Function in Speaking, Thinking and Communicating, Vol. 7. John Benjamins Publishing Company.
[13]
Sara B. Crawford, Andrzej S. Kosinski, Hung-Mo Lin, John M. Williamson, and Huiman X. Barnhart. 2007. Computer programs for the concordance correlation coefficient. Computer Methods and Programs in Biomedicine 88, 1 (2007), 62–74.
[14]
Navneet Dalal and Bill Triggs. 2005. Histograms of oriented gradients for human detection. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR ’05), Vol. 1. IEEE, 886–893.
[15]
Belle Dang, Andy Nguyen, and Sanna Järvelä. 2023a. Clustering deliberation sequences through regulatory triggers in collaborative learning. In Proceedings of the IEEE International Conference on Advanced Learning Technologies (ICALT ’23). IEEE, 158–160.
[16]
Belle Dang, Rosanna Vitiello, Andy Nguyen, Carolyn P. Rosé, and Sanna Järvelä. 2023b. How do students deliberate for socially shared regulation in collaborative learning? A process-oriented approach. In Proceedings of the 16th International Conference on Computer-Supported Collaborative Learning (CSCL ’23). International Society of the Learning Sciences, 59–66.
[17]
Jeff DePree, Stanley Su, and Xuelian Xiao. 2009. Event-triggered rule processing in a collaborative learning environment. In Proceedings of the E-Learn: World Conference on E-Learning in Corporate, Government, Healthcare, and Higher Education. Association for the Advancement of Computing in Education (AACE), 1580–1586.
[18]
Pierre Dillenbourg. 1999. What do you mean by collaborative learning? In Collaborative-Learning: Cognitive and Computational Approaches, 1–9.
[19]
Muhterem Dindar, Sanna Jarvela, Sara Ahola, Xiaohua Huang, and Guoying Zhao. 2020. Leaders and followers identified by emotional mimicry during collaborative learning: A facial expression recognition study on emotional valence. IEEE Transactions on Affective Computing 13, 3 (2020), 1390–1400.
[20]
Muhterem Dindar, Sanna Järvelä, Andy Nguyen, Eetu Haataja, and Ahsen Cini İricioglu. 2022. Detecting shared physiological arousal events in collaborative problem solving. Contemporary Educational Psychology 69 (2022), 102050.
[21]
Carlos Duarte and António Neto. 2009. Gesture interaction in cooperation scenarios. In Proceedings of the International Conference on Collaboration and Technology. Springer, 190–205.
[22]
T. Andrew Duchowski. 2017. Eye Tracking: Methodology Theory and Practice. Springer.
[23]
Jan Fermelis, Richard Tucker, and Stuart Palmer. 2007. Online self and peer assessment in large, multi-campus, multi-cohort contexts. In Proceedings of the Conference of the Australasian Society for Computers in Learning in Tertiary Education, 271–281.
[24]
Chris Frith. 2009. Role of facial expressions in social interactions. Philosophical Transactions of the Royal Society B: Biological Sciences 364, 1535 (2009), 3453–3458.
[25]
Kurt F. Geisinger. 2016. 21st century skills: What are they and how do we assess them? Applied Measurement in Education 29, 4 (2016), 245–249.
[26]
Joseph F. Grafsgaard, Joseph B. Wiggins, Kristy Elizabeth Boyer, Eric N. Wiebe, and James C. Lester. 2013. Embodied affect in tutorial dialogue: Student gesture and posture. In Proceedings of the Artificial Intelligence in Education: 16th International Conference (AIED ’13) (July 9–13, 2013). Springer, 1–10.
[27]
Allyson Hadwin, Sanna Järvelä, and Mariel Miller. 2018. Self-regulation, co-regulation, and shared regulation in collaborative learning environments. In Handbook of Self-Regulation of Learning and Performance. Routledge, 83–106.
[28]
Yang He and Yuqing Gong. 2022. Improving the quality of online learning: A study on teacher-student interaction based on network multi-modal data analysis. In Proceedings of the 5th International Conference on Big Data and Education, 311–318.
[29]
Nuria Hernández-Sellés, Pablo-César Muñoz-Carril, and Mercedes González-Sanmamed. 2019. Computer-supported collaborative learning: An analysis of the relationship between interaction, emotional support and online collaborative tools. Computers & Education 138 (2019), 1–12.
[30]
Miguel Ángel Herrera-Pavo. 2021. Collaborative learning for virtual higher education. Learning, Culture and Social Interaction 28 (2021), 100437.
[31]
Jaana Isohätälä, Piia Näykki, and Sanna Järvelä. 2020. Cognitive and socio-emotional interaction in collaborative learning: Exploring fluctuations in students’ participation. Scandinavian Journal of Educational Research 64, 6 (2020), 831–851.
[32]
Michel-Pierre Jansen, Khiet P. Truong, Dirk K. J. Heylen, and Deniece S. Nazareth. 2020. Introducing MULAI: A multimodal database of laughter during dyadic interactions. In Proceedings of the 12th Language Resources and Evaluation Conference, 4333–4342.
[33]
Sanna Järvelä, Andy Nguyen, and Allyson Hadwin. 2023a. Human and artificial intelligence collaboration for socially shared regulation in learning. British Journal of Educational Technology 54, 5 (2023), 1057–1076.
[34]
Sanna Järvelä, Andy Nguyen, Eija Vuorenmaa, Jonna Malmberg, and Hanna Järvenoja. 2023b. Predicting regulatory activities for socially shared regulation to optimize collaborative learning. Computers in Human Behavior 144 (2023), 107737.
[35]
Hanna Järvenoja, Sanna Järvelä, and Jonna Malmberg. 2020. Supporting groups’ emotion and motivation regulation during collaborative learning. Learning and Instruction 70 (2020), 101090.
[36]
Sander Koelstra, Christian Muhl, Mohammad Soleymani, Jong-Seok Lee, Ashkan Yazdani, Touradj Ebrahimi, Thierry Pun, Anton Nijholt, and Ioannis Patras. 2011. Deap: A database for emotion analysis; using physiological signals. IEEE Transactions on Affective Computing 3, 1 (2011), 18–31.
[37]
Shirli Kopelman, Ashleigh Shelby Rosette, and Leigh Thompson. 2006. The three faces of Eve: Strategic displays of positive, negative, and neutral emotions in negotiations. Organizational Behavior and Human Decision Processes 99, 1 (2006), 81–101.
[38]
Karel Kreijns, Paul A. Kirschner, and Wim Jochems. 2003. Identifying the pitfalls for social interaction in computer-supported collaborative learning environments: A review of the research. Computers in Human Behavior 19, 3 (2003), 335–353.
[39]
Susanne P. Lajoie. 2005. Extending the scaffolding metaphor. Instructional Science 33 (2005), 541–557.
[40]
Shan Li and Weihong Deng. 2020. Deep facial expression recognition: A survey. IEEE Transactions on Affective Computing 13, 3 (2020), 1195–1215.
[41]
Yante Li, Wei Peng, and Guoying Zhao. 2021. Micro-expression action unit detection with dual-view attentive similarity-preserving knowledge distillation. In Proceedings of the 2021 16th IEEE International Conference on Automatic Face and Gesture Recognition (FG ’21). IEEE, 1–8.
[42]
Zongmin Li, Yante Li, Yongbiao Gao, and Yujie Liu. 2016. Fast cross-scenario clothing retrieval based on indexing deep features. In Proceedings of the Advances in Multimedia Information Processing-PCM 2016: 17th Pacific-Rim Conference on Multimedia, Part I (September 15–16, 2016). Springer, 107–118.
[43]
Zongmin Li, Weiwei Tian, Yante Li, Zhenzhong Kuang, and Yujie Liu. 2015. A more effective method for image representation: Topic model based on latent dirichlet allocation. In Proceedings of the 2015 14th International Conference on Computer-Aided Design and Computer Graphics (CAD/Graphics). IEEE, 143–148.
[44]
Lisa Linnenbrink-Garcia, Toni Kempler Rogat, and Kristin L. K. Koskey. 2011. Affect and engagement during small group instruction. Contemporary Educational Psychology 36, 1 (2011), 13–24.
[45]
Xin Liu, Henglin Shi, Haoyu Chen, Zitong Yu, Xiaobai Li, and Guoying Zhao. 2021. iMiGUE: An identity-free video dataset for micro-gesture understanding and emotion analysis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10631–10642.
[46]
Yang Liu, Xingming Zhang, Janne Kauttonen, and Guoying Zhao. 2024. Uncertain facial expression recognition via multi-task assisted correction. IEEE Transactions on Multimedia 26 (2023), 2531–2543.
[47]
Yang Liu, Xingming Zhang, Yante Li, Jinzhao Zhou, Xin Li, and Guoying Zhao. 2023. Graph-based facial affect analysis: A review. IEEE Transactions on Affective Computing 14, 4 (2022), 2657–2677.
[48]
Jonna Malmberg, Sanna Järvelä, and Hanna Järvenoja. 2017. Capturing temporal and sequential patterns of self-, co-, and socially shared regulation in the context of collaborative learning. Contemporary Educational Psychology 49 (2017), 160–174.
[49]
David Ed Matsumoto, Hyisung C. Hwang, and Mark G. Frank. 2016. APA Handbook of Nonverbal Communication. American Psychological Association.
[50]
David McNeill. 1992. Hand and Mind: What Gestures Reveal about Thought. University of Chicago Press.
[51]
Marek P. Michalowski, Selma Sabanovic, and Reid Simmons. 2006. A spatial model of engagement for a social robot. In Proceedings of the 9th IEEE International Workshop on Advanced Motion Control. IEEE, 762–767.
[52]
Juan Abdon Miranda-Correa, Mojtaba Khomami Abadi, Nicu Sebe, and Ioannis Patras. 2018. AMIGOS: A dataset for affect, personality and mood research on individuals and groups. IEEE Transactions on Affective Computing 12, 2 (2018), 479–493.
[53]
Ebrahim Mohammadpour. 2013. A three-level multilevel analysis of Singaporean eighth-graders science achievement. Learning and Individual Differences 26 (2013), 212–220.
[54]
Ali Mollahosseini, Behzad Hasani, and Mohammad H. Mahoor. 2017. AffectNet: A database for facial expression, valence, and arousal computing in the wild. IEEE Transactions on Affective Computing 10, 1 (2017), 18–31.
[55]
Philipp Müller, Michael Xuelin Huang, and Andreas Bulling. 2018. Detecting low rapport during natural interactions in small groups from non-verbal behaviour. In Proceedings of the 23rd International Conference on Intelligent User Interfaces, 153–164.
[56]
Philipp Matthias Muller and Andreas Bulling. 2019. Emergent leadership detection across datasets. In Proceedings of the 2019 International Conference on Multimodal Interaction, 274–278.
[57]
Jauwairia Nasir, Aditi Kothiyal, Barbara Bruno, and Pierre Dillenbourg. 2021. Many are the ways to learn identifying multi-modal behavioral profiles of collaborative learning in constructivist activities. International Journal of Computer-Supported Collaborative Learning 16, 4 (2021), 485–523.
[58]
Andy Nguyen, Sanna Järvelä, Carolyn Rosé, Hanna Järvenoja, and Jonna Malmberg. 2023. Examining socially shared regulation and shared physiological arousal events with multimodal learning analytics. British Journal of Educational Technology 54, 1 (2023), 293–312.
[59]
Andy Nguyen, Sanna Järvelä, Yang Wang, and Carolyn Róse. 2022. Exploring socially shared regulation with an AI deep learning approach using multimodal data. In Proceedings of International Conferences of Learning Sciences (ICLS), 527–534.
[60]
Tuan Dinh Nguyen, Marisa Cannata, and Jason Miller. 2018. Understanding student behavioral engagement: Importance of student interaction with peers and teachers. The Journal of Educational Research 111, 2 (2018), 163–174.
[61]
Cristina Palmero, Javier Selva, Sorina Smeureanu, Julio Junior, CS Jacques, Albert Clapés, Alexa Moseguí, Zejian Zhang, David Gallardo, Georgina Guilera, David Leiva, and Sergio Escalera. 2021. Context-aware personality inference in dyadic scenarios: Introducing the UDIVA dataset. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 1–12.
[62]
Isabella Poggi and Francesca D’Errico. 2011. Social signals: A psychological perspective. In Computer Analysis of Human Behavior. Springer, 185–225.
[63]
Francis Quek, David McNeill, Robert Bryll, Susan Duncan, Xin-Feng Ma, Cemil Kirbas, Karl E. McCullough, and Rashid Ansari. 2002. Multimodal human discourse: Gesture and speech. ACM Transactions on Computer-Human Interaction (TOCHI) 9, 3 (2002), 171–193.
[64]
Muhammad Asif Qureshi, Asadullah Khaskheli, Jawaid Ahmed Qureshi, Syed Ali Raza, and Sara Qamar Yousufi. 2021. Factors affecting students’ learning performance through collaborative learning and engagement. Interactive Learning Environments 31, 4 (2023), 2371–2391.
[65]
Stephen W. Raudenbush and Anthony S. Bryk. 2002. Hierarchical Linear Models: Applications and Data Analysis Methods, Vol. 1. Sage.
[66]
Protection Regulation. 2018. General data protection regulation. Intouch 25 (2018), 1–5.
[67]
Joseph M. Reilly, Milan Ravenell, and Bertrand Schneider. 2018. Exploring collaboration using motion sensors and multi-modal learning analytics. In Paper Presented at the International Conference on Educational Data Mining (EDM). International Educational Data Mining Society.
[68]
Rikki Rimor, Yigal Rosen, and Kefaya Naser. 2010. Complexity of social interactions in collaborative learning: The case of online database environment. Interdisciplinary Journal of E-Learning and Learning Objects 6, 1 (2010), 355–365.
[69]
Michael S. Ryoo and Jake K. Aggarwal. 2009. Spatio-temporal relationship match: Video structure comparison for recognition of complex human activities. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), 1593–1600.
[70]
Michael S. Ryoo and Jake K. Aggarwal. 2010. UT-Interaction Dataset, ICPR Contest on Semantic Description of Human Activities (SDHA). Retrieved from http://cvrc.ece.utexas.edu/SDHA2010/Human_Interaction.html
[71]
Dairazalia Sanchez-Cortes, Oya Aran, Dinesh Babu Jayagopi, Marianne Schmid Mast, and Daniel Gatica-Perez. 2013. Emergent leaders through looking and speaking: From audio-visual data to multimodal recognition. Journal on Multimodal User Interfaces 7 (2013), 39–53.
[72]
Patrick E. Shrout and Joseph L. Fleiss. 1979. Intraclass correlations: Uses in assessing rater reliability. Psychological Bulletin 86, 2 (1979), 420.
[73]
Márta Sobocinski, Jonna Malmberg, and Sanna Järvelä. 2021. Exploring adaptation in socially-shared regulation of learning using video and heart rate data. Technology, Knowledge and Learning 27, 2 (2022), 385–404.
[74]
Mohammad Soleymani, Jeroen Lichtenauer, Thierry Pun, and Maja Pantic. 2011. A multimodal database for affect recognition and implicit tagging. IEEE Transactions on Affective Computing 3, 1 (2011), 42–55.
[75]
Amy Soller. 2001. Supporting social interaction in an intelligent collaborative learning system. International Journal of Artificial Intelligence in Education 12, 1 (2001), 40–62.
[76]
Laurence D. Steinberg. 2014. Age of Opportunity: Lessons from the New Science of Adolescence. Houghton Mifflin Harcourt.
[77]
Tanya Stivers and Jack Sidnell. 2005. Introduction: Multimodal interaction. Semiotica 2005, 156 (2005), 1–20.
[78]
John D. Teasdale and M. Louise Russell. 1983. Differential effects of induced mood on the recall of positive, negative and neutral words. British Journal of Clinical Psychology 22, 3 (1983), 163–171.
[79]
Antoine Toisoul, Jean Kossaifi, Adrian Bulat, Georgios Tzimiropoulos, and Maja Pantic. 2021. Estimation of continuous valence and arousal levels from faces in naturalistic conditions. Nature Machine Intelligence 3, 1 (2021), 42–50.
[80]
Gianmarco Ipinze Tutuianu, Yang Liu, Ari Alamäki, and Janne Kauttonen. 2023. Benchmarking deep facial expression recognition: An extensive protocol with balanced dataset in the wild. arXiv:2311.02910.
[81]
Kartik Vermun, Mohit Senapaty, Anindhya Sankhla, Priyadarshi Patnaik, and Arobinda Routray. 2013. Gesture-based affective and cognitive states recognition using Kinect for effective feedback during e-learning. In Proceedings of the 2013 IEEE 5th International Conference on Technology for Education (t4e ’13). IEEE, 107–110.
[82]
Simone Volet, Mark Summers, and Joanne Thurman. 2009. High-level co-regulation in collaborative learning: How does it emerge and how is it sustained? Learning and Instruction 19, 2 (2009), 128–143.
[83]
Eija Vuorenmaa, Sanna Järvelä, Muhterem Dindar, and Hanna Järvenoja. 2023. Sequential patterns in social interaction states for regulation in collaborative learning. Small Group Research 54, 4 (2023), 512–550.
[84]
Noreen M. Webb, Marsha Ing, Nicole Kersting, and Kariane Mari Nemer. 2013. Help seeking in cooperative learning groups. In Help Seeking in Academic Settings. Routledge, 56–99.
[85]
Bin Xie, Joseph M. Reilly, Yong Li Dich, and Bertrand Schneider. 2018. Augmenting qualitative analyses of collaborative learning groups through multi-modal sensing. In Proceedings of the Rethinking Learning in the Digital Age: Making the Learning Sciences Count, 13th International Conference of the Learning Sciences (ICLS), Vol. 1. J. Kay and R. Luckin (Eds.). International Society of the Learning Sciences, Inc.
[86]
Wei Xu. 2019. Toward human-centered AI: A perspective from human-computer interaction. Interactions 26, 4 (2019), 42–46.
[87]
Fanglei Xue, Qiangchang Wang, Zichang Tan, Zhongsong Ma, and Guodong Guo. 2023. Vision transformer with attentive pooling for robust facial expression recognition. IEEE Transactions on Affective Computing 14, 4 (2023), 3244–3256.

Cited By

View all
  • (2024)Editorial: Towards Emotion AI to next generation healthcare and educationFrontiers in Psychology10.3389/fpsyg.2024.153305315Online publication date: 19-Dec-2024
  • (2024)Benchmarking deep Facial Expression RecognitionEngineering Applications of Artificial Intelligence10.1016/j.engappai.2024.108983136:PBOnline publication date: 18-Nov-2024

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Interactive Intelligent Systems
ACM Transactions on Interactive Intelligent Systems  Volume 14, Issue 3
September 2024
384 pages
EISSN:2160-6463
DOI:10.1145/3613608
  • Editor:
  • Shlomo Berkovsky
Issue’s Table of Contents
This work is licensed under a Creative Commons Attribution International 4.0 License.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 02 August 2024
Online AM: 22 April 2024
Accepted: 18 March 2024
Revised: 12 February 2024
Received: 29 December 2022
Published in TIIS Volume 14, Issue 3

Check for updates

Author Tags

  1. Multimodal dataset
  2. human interaction
  3. facial expression
  4. gesture
  5. physiological signal
  6. collaborative learning

Qualifiers

  • Research-article

Funding Sources

  • Academy Professor Project Emotion AI
  • University of Oulu and Research Council of Finland Profi 7
  • National Natural Science Foundation of China
  • Finnish Cultural Foundation for North Ostrobothnia Regional Fund
  • Instrumentarium Foundation

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)1,670
  • Downloads (Last 6 weeks)267
Reflects downloads up to 08 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Editorial: Towards Emotion AI to next generation healthcare and educationFrontiers in Psychology10.3389/fpsyg.2024.153305315Online publication date: 19-Dec-2024
  • (2024)Benchmarking deep Facial Expression RecognitionEngineering Applications of Artificial Intelligence10.1016/j.engappai.2024.108983136:PBOnline publication date: 18-Nov-2024

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Full Access

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media