1 Introduction
Peer feedback (PF) is a key component of online education [
21], especially in Computer-Supported Collaborative Learning (CSCL) settings [
12] and predictive to learning performance in educational settings [
30]. When working in CSCL settings, students generate large volumes of PF data (largely in the form of text entries) that could be used to not only improve our understanding of their learning processes in these settings but also to provide improved in-time support to the learner. This is important since the specificity or the quality of PF affects the learning improvement level [
21]. Such quality can be improved by instructional support based on enhanced insights into the students’ nature of PF. In earlier research focusing on the exploration of the PF nature, scholars have largely classified student PF data using qualitative coding [
17,
43], which is a very time-consuming task that is not feasible for many stakeholders, including researchers and teachers.
Because PF activities are increasingly practiced in large courses (with > 100 students [
43]), the analysis of student PF data needs to be automatized (e.g., [
22]) to better support teachers, instructors, and examiners to identify areas in which learners could improve throughout the course as well as after it and ultimately, provide support that is conducive for students’ learning. Exploring Natural Language Processing (NLP) techniques could address this gap, as NLP has previously been effective in identifying patterns in large datasets for educational contexts [
16,
22].
Previous work has been done in analyzing instructor-based feedback [
8,
10]. However, it is unclear what NLP model would show the best results when processing and interpreting PF data [
21]. Specifically, grasping the nature of PF messages to understand what the feedback was aimed at (i.e., meaning) could be challenging. Therefore, there is a need for more research on and a better understanding of how NLP models can particularly identify the nature of student PF in (PF) text entries data in CSCL settings [
12].
To address the above-mentioned gap, this study investigates the effectiveness of six selected NLP models in automating the classification of student PF. Particularly, we examine how these models accurately identify the different aspects of the nature of students’ feedback in peer evaluation. For this, we compare the two traditional algorithms (Decision Tree and Multilayer perceptron - MLP), the three transformers-based models (BERT, RoBERTa, and DistilBERT), which have demonstrated state-of-the-art performance in several tasks [
10] and
GPT4o, a multimodal large language model [
48]. Moreover, to gain a better understanding of
how the NLP model makes its decisions, we implement Explainable Artificial Intelligence (XAI) [
2] that aims at offering explanations about why a given model is making different decisions and to identify potential biases and errors.
3 Method
This research has been approved by an internal institutional ethical committee (Nr: V-2021-0392).
3.1 Dataset
The dataset used in this study consists of PF messages collected from a large project-based university course in Europe. It involves data from 231 students who collaboratively worked on their projects in groups of four to six students. In total, the dataset contains 2,444 feedback messages written in English, which were divided into 10,319 sentences. These sentences were derived from peer assessments that were part of both formative and summative evaluations. Overall each student provided anonymous feedback to four to five peers in their group, summing up to between eight and 12 texts of feedback throughout the course. Then, these messages were reviewed by the course teaching assistants and the course examiner to assess individual student contributions. The CLASS system enabled all these PF interactions (see [
6]).
Once collected, the data was cleaned and annotated using an exploratory approach based on the Boyatzis framework [
3]. This approach allowed for feedback categories to emerge naturally from the text data based on the annotation process that involved four main steps. First, two annotators analyzed a small subset of the data (3%) and identified six categories: management, suggestions for improvement, interpersonal factors, cognition, affect, and miscellaneous, which align with established taxonomies in CSCL and feedback research [
5,
43]. Second, a sample of 500 feedback sentences was coded separately by two annotators, which achieved a moderate level of agreement (Cohen’s kappa = 0.65). Third, discrepancies were discussed and re-annotated, with a final Cohen’s kappa value of 0.86, indicating a strong agreement among raters. Lastly, the remaining dataset was coded by the annotators independently. Table
1 presents examples from this dataset along with a description of each feedback class and the number of samples per class.
3.2 Model Selection and Evaluation
Six models have been implemented: two traditional and four transformer-based, including GPT4o. For developing both traditional and transformer models, this study utilized an NVIDIA P100 GPU with 16 GB of RAM, where all experiments were performed on the Kaggle platform.
The traditional models used in this study were MLP and Decision Tree, both implemented using the Python library Scikit-learn with default parameter settings. These models were selected due to their effectiveness across various studies in previous text classification tasks [
9,
16]. To represent the input data, we employed TF-IDF (Term Frequency-Inverse Document Frequency) [
32] as the feature extraction method commonly used for text analysis to weigh the importance of words within the dataset[
16].
The transformer-based models utilized in this study were BERT, RoBERTa, and DistilBERT, all implemented using the SimpleTransformers library, which is built upon the widely adopted Hugging Face Transformers framework. These models were configured with their default parameter settings, ensuring consistency with standard practices in transformer-based natural language processing tasks while leveraging their robust pre-trained architectures for effective text classification.
We also integrated a GPT model to explore the potential of large-scale language models in this context. We adopted the gpt-4o-2024-08-06 model
1 through the OpenAI API
2, which was the most advanced version available. It is well known that the design of these prompts plays a key role in the model’s capacity to generate accurate and relevant results [
46]. In our study, the prompts were carefully tailored to the specific context of the task, incorporating clear and direct instructions and specifying the desired output format [
33]. Table
2 presents the final prompt used in this study.
To answer the first research question, we split the dataset into a training set (80%) and a test set (20%) to facilitate the evaluation process. To assess the models’ performance, we employed a comprehensive set of widely recognized metrics, including accuracy and Cohen’s kappa, which are commonly used in similar studies to compare classification effectiveness in the educational domain [
16].
To answer the second research question, we used the LIME library, accessible through Python, to implement LIME for interpretability. This study applied LIME to the top-performing model to provide insights into its decision-making process. A subset of the dataset was selected, allowing LIME to explain the model’s predictions by highlighting the specific words that influenced the classification. LIME assigns probabilities to these words, demonstrating their contribution to determining whether a sentence belongs to a particular class, thus offering a clearer understanding of the factors driving the model’s predictions.
5 Discussion
This study aimed at improving our understanding of how the classification of student PF text data could be automated (through machine learning) to scale up related learning analytics interventions and support (e.g., adaptive scaffolding of students) in CSCL settings. We examined six machine learning models on their performance to classify PF text messages from students in a large CSCL course (n = 231 students) in higher education. We extend previous work using NLP to automatically classify student PF text data, with a
focus on the student, to better understand the nature of their feedback to the peer. We found the transformers-based models (BERT, RoBERTa and DistilBERT) outperformed the traditional models MLP and Decision Tree as well as
ChatGPT4o. The latter finding is in line with the recent results by Hutt et al. [
22] who have similarly shown that ChatGPT is outperformed by classical NLP techniques in automated detection of (peer) feedback quality. In the future, different kinds of prompting strategies that have been found to be predictive of the quality of LLMs-generated outputs [
25] and other LLMs have to be examined in similar settings and beyond.
The results of the present study build on the findings by [
22] by presenting another effective way of using NLP to classify student PF text data. The traditional models probably struggled to perform as well as transformers-based models since they lacked pre-training. RoBERTa achieved an accuracy of 73.1% and a Kappa score of 0.61, indicating moderate agreement [
29]. In that sense, RoBERTa’s performance is regarded as reliable since the study by Castro et al. [
6] achieved the highest F1-score of 0.58 and a Cohen’s
κ of 0.43 with CRF. Moreover, it is aligned with the previous paper on classifying feedback messages where BERT variations reached better results [
10]. This finding has the potential to scale up the automated coding of student PF text data for improved learner and teaching support based on LA interventions. However, before scaling this up, we are calling for replication studies in this regard; datasets from similar educational contexts. Furthermore, the application of RoBERTa and other transformer-based ML models on datasets in languages other than English has to be further explored [
16].
Further, this study aimed at improving our understanding of the role of XAI, namely the LIME [
45] to understand the decision-making processes of the best performing model. The LIME method was implemented to analyze RoBERTa’s prediction on two contexts selected from the dataset. The most predictive words highlighted by XAI are aligned with the theory behind the classes we were evaluating [
44].
A practical implication of these findings would be that RoBERTa has the potential to accurately identify the nature of PF text data, providing useful insights for educators and researchers in the field. Therefore, this model can be used to aid the examiner, teachers, and instructors in analyzing and identifying the nature of PF and potentially its relation to performance to support future learning processes within that learning environment. A specific implication for using NLP models to analyze PF text data would be to take into account if there are stopwords in the text data. By using XAI to analyze three different contexts, we determined that the consideration of specific stopwords is a crucial factor in RoBERTa’s decision making for predicting.
For instance, one might explore our findings to design a human-in-the-loop approach that could enhance the interpretability and reliability of the model’s outputs. Involving educators or domain specialists into the data analysis process would allow for an iterative refinement of the model’s outputs, ensuring that its insights align with stakeholders’ needs and educational contexts. Additionally, this approach could helps fine-tuning the model by integrating automated analysis and human expertise. As a result, implementing such a human-in-the-loop strategy could lead to a deeper understanding of the relationship between language patterns in PF text messages and educational outcomes, ultimately contributing to more personalized and effective learning interventions and improved learning outcomes.
In that context, LIME is prominent to foster transparency and trust. Because LIME enables understanding RoBERTa’s individual predictions by highlighting the components of text that most influence the model’s outcomes, it helps in building trust with users who would rely on these insights for making informed decisions. By highlighting how specific features, such as stopwords, impact the analysis, LIME supports educators in validating and refining the model’s predictions. Consequently, this would enhance the credibility of automated feedback systems and encourage their integration into educational practices.
This paper has several
limitations. First, the dataset used in this study was unbalanced, which indicates models might favor the classes in which the dataset includes the most instances. There are several methods to deal with data imbalance, such as undersampling the classes that have the most data and oversampling the classes that have less data so that all classes have the same number of labels. Additionally, the default hyperparameters were used for all models, and the models’ performances could be improved if the hyperparameters were tuned. The hyperparameters of the models can be fine-tuned using a variety of methods to improve the final performance of the models [
37]. Furthermore, the GPT result could also be improved by incorporating different elements, such as examples (few-shoot prompts). Lastly, the data was split in this case so that 80% was used for training and 20% was used for testing. The amount of data used for training and testing can be changed, however, the model’s performance may vary. Therefore, methods such as cross-validation can be used, which means that different subsets of the data are divided up for training and testing. Similarly, the transformer-based models could be further investigated in terms of different architectures and training procedures. Finally, finding a model that has already been pre-trained on peer assessment feedback or similar data could be valuable in understanding how applicable it is to the dataset used in this study and similar ones.
6 Conclusion
The strong performance of transformer-based models, particularly RoBERTa suggests a shift in the way natural language processing tasks like peer feedback classification can be approached in the CSCL educational settings. By outperforming traditional models and prior approaches, this study demonstrates the potential of transformers to enhance learning analytics and AI-informed applications in education, particularly when analyzing nuanced, unstructured text data such as in the case of this study. This could pave the way for more accurate automated feedback systems, better peer interaction analysis, and improved instructional design in collaborative learning environments. Overall, the use of transformer models, especially RoBERTa, along with explainable AI, in classifying peer feedback in CSCL settings not only surpasses traditional models but also sets a new benchmark in terms of accuracy and reliability, marking a notable improvement in educational natural language processing applications.