short-paper

Open access

That's What RoBERTa Said: Explainable Classification of Peer Feedback

Authors: Kevin Huang, Rafael Ferreira Mello, Cleon Pereira Junior, Luiz Rodrigues, Martine Baars, Olga VibergAuthors Info & Claims

LAK '25: Proceedings of the 15th International Learning Analytics and Knowledge Conference

Pages 880 - 886

https://doi.org/10.1145/3706468.3706526

Published: 03 March 2025 Publication History

PDF eReader

Abstract

Peer feedback (PF) is essential for improving student learning outcomes, particularly in Computer-Supported Collaborative Learning (CSCL) settings. When using digital tools for PF practices, student data (e.g., PF text entries) is generated automatically. Analyzing these large datasets can enhance our understanding of how students learn and help improve their learning. However, manually processing these large datasets is time-intensive, highlighting the need for automation. This study investigates the use of six machine learning models to classify PF messages from 231 students in a large university course. The models include Multi-Layer Perceptron (MLP), Decision Tree, BERT, RoBERTa, DistilBERT, and ChatGPT4o. The models were evaluated based on Cohen’s accuracy and F1-score. Preprocessing involved removing stop words, and the impact of this on model performance was assessed. Results showed that only the Decision Tree model improved with stop-word removal, while performance decreased in the other models. RoBERTa consistently outperformed the others across all metrics. Explainable AI was used to understand RoBERTa’s decisions by identifying the most predictive words. This study contributes to the automatic classification of peer feedback which is crucial for scaling learning analytics efforts aiming to provide better in-time support to students in CSCL settings.

1 Introduction

Peer feedback (PF) is a key component of online education [21], especially in Computer-Supported Collaborative Learning (CSCL) settings [12] and predictive to learning performance in educational settings [30]. When working in CSCL settings, students generate large volumes of PF data (largely in the form of text entries) that could be used to not only improve our understanding of their learning processes in these settings but also to provide improved in-time support to the learner. This is important since the specificity or the quality of PF affects the learning improvement level [21]. Such quality can be improved by instructional support based on enhanced insights into the students’ nature of PF. In earlier research focusing on the exploration of the PF nature, scholars have largely classified student PF data using qualitative coding [17, 43], which is a very time-consuming task that is not feasible for many stakeholders, including researchers and teachers.

Because PF activities are increasingly practiced in large courses (with > 100 students [43]), the analysis of student PF data needs to be automatized (e.g., [22]) to better support teachers, instructors, and examiners to identify areas in which learners could improve throughout the course as well as after it and ultimately, provide support that is conducive for students’ learning. Exploring Natural Language Processing (NLP) techniques could address this gap, as NLP has previously been effective in identifying patterns in large datasets for educational contexts [16, 22].

Previous work has been done in analyzing instructor-based feedback [8, 10]. However, it is unclear what NLP model would show the best results when processing and interpreting PF data [21]. Specifically, grasping the nature of PF messages to understand what the feedback was aimed at (i.e., meaning) could be challenging. Therefore, there is a need for more research on and a better understanding of how NLP models can particularly identify the nature of student PF in (PF) text entries data in CSCL settings [12].

To address the above-mentioned gap, this study investigates the effectiveness of six selected NLP models in automating the classification of student PF. Particularly, we examine how these models accurately identify the different aspects of the nature of students’ feedback in peer evaluation. For this, we compare the two traditional algorithms (Decision Tree and Multilayer perceptron - MLP), the three transformers-based models (BERT, RoBERTa, and DistilBERT), which have demonstrated state-of-the-art performance in several tasks [10] and GPT4o, a multimodal large language model [48]. Moreover, to gain a better understanding of how the NLP model makes its decisions, we implement Explainable Artificial Intelligence (XAI) [2] that aims at offering explanations about why a given model is making different decisions and to identify potential biases and errors.

2 Background

2.1 Peer Assessment and Peer Feedback

Peer assessment is a process in which students evaluate and provide feedback on each other’s performance [13]. It is an educational strategy where students assess the other students’ quality and provide helpful feedback to improve their learning process [42]. In a participatory learning context, peer assessment is critical in providing individuals or groups with valuable information to enhance their performance [26, 35]. Furthermore, peer assessment is usually suggested to increase students’ engagement with their own learning [18].

Peer assessment and PF are interrelated processes in the collaborative learning context [27]. As the outcome of the peer assessment process, in which a student evaluates the performance of another student (i.e., the peer), PF can be offered to the student whose work has been evaluated [43]. Receiving PF allows students to reflect on their own work and make more adjustments and improvements [13]. PF has been demonstrated to improve student learning and academic performance (e.g.,[30, 40]). Simonsmeier et al. [40] stress that PF is an effective method for improving students’ academic self-concept in the setting of academic writing. In that context, Gielen and De Wever [18] reviewed 109 papers on peer assessment, where the results reveal that the practice offers benefits for both assessors and those being assessed in multiple ways. These benefits include promoting constructive reflection, increased time on task, attention to important parts of quality work, and a greater sense of responsibility.

One crucial aspect of effective PF is its nature, which is understood in terms of the specificity and the meaning of the information in the feedback. Huang et al. [21] showcased the quality, or the nature of PF affected the learning improvement level. Scholars have earlier examined the PF nature, with varying results. Ion et al. [23] performed a study on the nature of PF using semantic analysis and identified three types of feedback: (1) task-oriented feedback focused on motivation, (2) feedback related to emotional aspects, and (3) feedback concerning the structural aspects of the activity. Chien et al. [14] classified PF into four categories: praise, opinion, criticism, and irrelevance. Further, scholars have classified PF focusing on its affective and cognitive nature (e.g.,[17]). All the above mentioned examples were conducted based on the manual coding process and demonstrated varying results. To gain more insights into the nature of PF, there is a need to investigate how we can analyze large amounts of text data from PF activities to further unravel the nature of PF, and provide in-time student support for struggling students in CSCL seetings. In this study, we implemented a bottom-up approach, in which we explore the affordances of machine learning to analyze a large amount of text data (n = 231 students, for more, see Method) from PF activities in the CSCL setting of engineering higher education.

2.2 Classification of Feedback Messages

Previous works have used NLP to classify feedback messages. For instance, Cavalcanti et al. [7] conducted a content analysis of instructor feedback. Their study focused on assessing the quality of instructor feedback extracted from assessments in Brazilian Portuguese. Using a random forest classifier, the authors achieved an accuracy of 0.75 and a Cohen’s kappa of 0.20. Similarly, [34] applied the XGBoost classifier to a dataset of feedback texts in English. Their model achieved accuracy values of 0.87, 0.82, and 0.69 for self, task, and process feedback levels[19], respectively. Additionally, the study investigated the most important textual features associated with each feedback level. In another study, [38] proposed a multi-label classification approach for feedback levels [19], incorporating hyperparameter tuning to enhance model performance. The authors experimented with support vector machine, random forest, and k-nearest neighbors classifiers. In the best case, the support vector machine algorithm achieved a 0.871 f-score.

Regarding analyzing PF messages, Castro et al. [5] SVM, KNN, Logistic Regression, Naive Bayes, Random Forest, AdaBoost, XGBoost, and CRF models for this task. The three compared methods for representing features in these models were content-based features (TF-IDF), content-independent, and sequential features. The results demonstrated that using content-based features with TF-IDF resulted in the best performance for the models, with the top-performing model being CRF, reaching Cohen’s κ of 0.43.

Recently, Cavalcanti et al. [10] explored applying the BERT transformer model to classify instructor feedback based on established educational frameworks. The proposed method significantly improved over previous approaches, with BERT outperforming traditional machine learning models by up to 35.71%, reaching up to 0.76 in Cohen’s kappa. Also, the authors integrated XAI techniques to enhance the interpretability of the model’s predictions, indicating the most influential features in the classification process. The gains obtained by Cavalcanti et al. [10] demonstrate the potential of using BERT in feedback classification, as it outperformed traditional models. Moreover, integrating XAI ensures transparency in these predictions, giving educators more precise insights into the factors influencing feedback quality.

Finally, Hutt et al. [22] examined both classical NLP techniques and ChatGPT to devise an automated detector of peer feedback quality in K-12 settings. Classical NLP detectors that combine simplistic NLP approaches and supervise machine learning–were found to be more accurate in this task than ChatGPT.

2.3 Language Models and Explainable AI

The literature highlights that BERT models have produced significant results in the education domain [10, 20]. In short, BERT is a deeply bidirectional language model designed to capture contextual representations in language by training on massive datasets, such as the Books Corpus and Wikipedia [15]. This extensive pre-training enables BERT to generate sophisticated contextual embeddings, making it highly effective for various NLP tasks.

Besides the traditional architecture, BERT model also have some variations. RoBERTa enhances the original BERT model, addressing limitations in BERT’s training by employing dynamic masking, multiplying the data tenfold to expose the model to varied masking patterns within the same sequence [31]. Unlike BERT, which uses static masking and masks each sequence once, RoBERTa’s dynamic approach improves training diversity. Similarly, DistilBERT, another variant of BERT, maintains the same architecture but with 50% fewer layers and 40% fewer parameters [39], aiming to offer a faster, more efficient version of BERT while retaining 97% of its performance.

Recently, GPT models have been widely recognized in academic research for their ability to address various challenges [41, 48]. For instance, studies have demonstrated the effectiveness of using GPT models, even in zero-shot scenarios, to assess essays [4] and short responses [11] automatically. Moreover, recent studies show the potential of GPT for supporting the feedback process [47].

Language models have demonstrated promising results in educational settings. Yet, a significant challenge for their practical adoption is the lack of transparency in how they generate predictions, making it difficult to understand or trust the results of these algorithms [2]. The field of Explainable Artificial Intelligence (XAI) addresses this issue by providing insights into the decision-making processes of these models, ultimately enhancing trust and usability.

An example of an XAI method is the Local Interpretable Model-Agnostic Explanations (LIME), which highlights parts of the input that contribute to a model’s predictions [45]. The primary aim of LIME is "to identify an interpretable model over the interpretable representation that is locally faithful to the classifier" [36]. Rather than attempting to create a globally interpretable model, LIME trains local models to explain individual predictions [1]. These interpretable models are designed to offer clear insights while approximating the underlying model’s behavior as accurately as possible, offering interpretability to the original predictions [1].

2.4 Research Questions

Based on the increasing use of PF in educational settings, particularly in the CSCL setting of large courses [24], there is a growing need to automate the analysis of large datasets generated by students. Earlier research has focused mainly on analyzing instructor feedback, but the classification of student PF needs further investigation. This study aims to fill this gap by evaluating the performance of various NLP models in identifying the nature of PF messages. Furthermore, we investigate the role of XAI in providing insights into the predictions made by the top-performing model. These goals are reflected in the following research questions:

RQ1: To what extent can NLP models accurately identify the nature of peer feedback messages based on text data?

RQ2: To what extent can XAI deepen our understanding of the decision-making processes of the best performing NLP model?

3 Method

This research has been approved by an internal institutional ethical committee (Nr: V-2021-0392).

3.1 Dataset

The dataset used in this study consists of PF messages collected from a large project-based university course in Europe. It involves data from 231 students who collaboratively worked on their projects in groups of four to six students. In total, the dataset contains 2,444 feedback messages written in English, which were divided into 10,319 sentences. These sentences were derived from peer assessments that were part of both formative and summative evaluations. Overall each student provided anonymous feedback to four to five peers in their group, summing up to between eight and 12 texts of feedback throughout the course. Then, these messages were reviewed by the course teaching assistants and the course examiner to assess individual student contributions. The CLASS system enabled all these PF interactions (see [6]).

Once collected, the data was cleaned and annotated using an exploratory approach based on the Boyatzis framework [3]. This approach allowed for feedback categories to emerge naturally from the text data based on the annotation process that involved four main steps. First, two annotators analyzed a small subset of the data (3%) and identified six categories: management, suggestions for improvement, interpersonal factors, cognition, affect, and miscellaneous, which align with established taxonomies in CSCL and feedback research [5, 43]. Second, a sample of 500 feedback sentences was coded separately by two annotators, which achieved a moderate level of agreement (Cohen’s kappa = 0.65). Third, discrepancies were discussed and re-annotated, with a final Cohen’s kappa value of 0.86, indicating a strong agreement among raters. Lastly, the remaining dataset was coded by the annotators independently. Table 1 presents examples from this dataset along with a description of each feedback class and the number of samples per class.

Table 1:

Category	Description	Example
Management (n = 1932)	All contributions to the group and the student’s tasks during the project. No emotional opinion. All task-related behaviors.	You also handled your interview nicely, asking good follow up questions to clear up any potential misunderstandings.
Affect (n = 1612)	Effect of the student on the group or on the feedback giving person.	You have contributed actively to the design process.
Interpersonal factors (n = 207)	All interaction, both positive and negative, and the process of working in a group.	Is always willing to collaborate with the rest of the group and participate in the discussions.
Suggestions for improvement (n = 231)	Suggestions on how to improve on a certain domain.	His positive attitude towards all ideas means a lot to many of our group members, as group work can usually generate a lot of negativity.
Cognition (n = 546)	Feedback related to thinking and inspiration.	You work well in a group and often provide good points during discussions.
Miscellaneous (n = 240)	Interesting feedback that does not fit the other categories.	don’t think much has changed since the previous peer feedback, and therefore I don’t have much to add regarding your performance.

Table 1: Samples from the dataset [6].

3.2 Model Selection and Evaluation

Six models have been implemented: two traditional and four transformer-based, including GPT4o. For developing both traditional and transformer models, this study utilized an NVIDIA P100 GPU with 16 GB of RAM, where all experiments were performed on the Kaggle platform.

The traditional models used in this study were MLP and Decision Tree, both implemented using the Python library Scikit-learn with default parameter settings. These models were selected due to their effectiveness across various studies in previous text classification tasks [9, 16]. To represent the input data, we employed TF-IDF (Term Frequency-Inverse Document Frequency) [32] as the feature extraction method commonly used for text analysis to weigh the importance of words within the dataset[16].

The transformer-based models utilized in this study were BERT, RoBERTa, and DistilBERT, all implemented using the SimpleTransformers library, which is built upon the widely adopted Hugging Face Transformers framework. These models were configured with their default parameter settings, ensuring consistency with standard practices in transformer-based natural language processing tasks while leveraging their robust pre-trained architectures for effective text classification.

We also integrated a GPT model to explore the potential of large-scale language models in this context. We adopted the gpt-4o-2024-08-06 model¹ through the OpenAI API², which was the most advanced version available. It is well known that the design of these prompts plays a key role in the model’s capacity to generate accurate and relevant results [46]. In our study, the prompts were carefully tailored to the specific context of the task, incorporating clear and direct instructions and specifying the desired output format [33]. Table 2 presents the final prompt used in this study.

Table 2:

Element	Text
Context	Peer feedback written by higher education students in the engineering education setting.
Instruction	Determine which category of peer feedback is predominant in the following.
	You should consider the following categories: Management, Suggestions for improvement,
	Interpersonal factors, Cognition, Affect, and Miscellaneous.
Output format	indicate only one category in the response

Table 2: Prompt for GPT 4o

To answer the first research question, we split the dataset into a training set (80%) and a test set (20%) to facilitate the evaluation process. To assess the models’ performance, we employed a comprehensive set of widely recognized metrics, including accuracy and Cohen’s kappa, which are commonly used in similar studies to compare classification effectiveness in the educational domain [16].

To answer the second research question, we used the LIME library, accessible through Python, to implement LIME for interpretability. This study applied LIME to the top-performing model to provide insights into its decision-making process. A subset of the dataset was selected, allowing LIME to explain the model’s predictions by highlighting the specific words that influenced the classification. LIME assigns probabilities to these words, demonstrating their contribution to determining whether a sentence belongs to a particular class, thus offering a clearer understanding of the factors driving the model’s predictions.

4 Results

4.1 RQ1: Machine Learning Classification of PF messages

Table 3 presents the performance of the models evaluated in this study. The results reveal that RoBERTa had the highest performance across all metrics with an accuracy of 73.1 and a Cohen’s κ score of 0.61, which is considered a substantial agreement with the ground truth [28]. All the other BERT variations outperformed the traditional methods. Moreover, it is important to highlight the fact that although GPT4o has not achieved performance comparable to BERT, it has not used any data in its classification process.

Table 3:

Model	Accuracy	κ
Decision Tree	54.7	0.38
MLP	63.7	0.48
BERT	70.3	0.60
RoBERTa	73.1	0.61
DistilBERT	69.6	0.59
GPT4o	62.3	0.49

Table 3: Results for the analyzed models

4.2 RQ2: XAI to Understand Feedback Classifications

Given RoBERTa’s highest performance, LIME has been used to gain insights into the reasoning behind the model’s choices in some contexts from the dataset. For this, we present two examples, considering different feedback classes, where words shaded in a darker color indicate (purple and blue) a higher probability for the feedback to be classified within the considered class, while those with lighter shading indicate a lower probability.

Figure 1a shows an example of a feedback connected to the category ’Cognition’. Words like “afraid”, “someone”, “open” have the highest probability of explaining the model’s decision. Figure 1b presents an example of a text from the category ’Miscellaneous’. In this case, "could" and "improve" were the most predictive words.

Figure 1:

5 Discussion

This study aimed at improving our understanding of how the classification of student PF text data could be automated (through machine learning) to scale up related learning analytics interventions and support (e.g., adaptive scaffolding of students) in CSCL settings. We examined six machine learning models on their performance to classify PF text messages from students in a large CSCL course (n = 231 students) in higher education. We extend previous work using NLP to automatically classify student PF text data, with a focus on the student, to better understand the nature of their feedback to the peer. We found the transformers-based models (BERT, RoBERTa and DistilBERT) outperformed the traditional models MLP and Decision Tree as well as ChatGPT4o. The latter finding is in line with the recent results by Hutt et al. [22] who have similarly shown that ChatGPT is outperformed by classical NLP techniques in automated detection of (peer) feedback quality. In the future, different kinds of prompting strategies that have been found to be predictive of the quality of LLMs-generated outputs [25] and other LLMs have to be examined in similar settings and beyond.

The results of the present study build on the findings by [22] by presenting another effective way of using NLP to classify student PF text data. The traditional models probably struggled to perform as well as transformers-based models since they lacked pre-training. RoBERTa achieved an accuracy of 73.1% and a Kappa score of 0.61, indicating moderate agreement [29]. In that sense, RoBERTa’s performance is regarded as reliable since the study by Castro et al. [6] achieved the highest F1-score of 0.58 and a Cohen’s κ of 0.43 with CRF. Moreover, it is aligned with the previous paper on classifying feedback messages where BERT variations reached better results [10]. This finding has the potential to scale up the automated coding of student PF text data for improved learner and teaching support based on LA interventions. However, before scaling this up, we are calling for replication studies in this regard; datasets from similar educational contexts. Furthermore, the application of RoBERTa and other transformer-based ML models on datasets in languages other than English has to be further explored [16].

Further, this study aimed at improving our understanding of the role of XAI, namely the LIME [45] to understand the decision-making processes of the best performing model. The LIME method was implemented to analyze RoBERTa’s prediction on two contexts selected from the dataset. The most predictive words highlighted by XAI are aligned with the theory behind the classes we were evaluating [44].

A practical implication of these findings would be that RoBERTa has the potential to accurately identify the nature of PF text data, providing useful insights for educators and researchers in the field. Therefore, this model can be used to aid the examiner, teachers, and instructors in analyzing and identifying the nature of PF and potentially its relation to performance to support future learning processes within that learning environment. A specific implication for using NLP models to analyze PF text data would be to take into account if there are stopwords in the text data. By using XAI to analyze three different contexts, we determined that the consideration of specific stopwords is a crucial factor in RoBERTa’s decision making for predicting.

For instance, one might explore our findings to design a human-in-the-loop approach that could enhance the interpretability and reliability of the model’s outputs. Involving educators or domain specialists into the data analysis process would allow for an iterative refinement of the model’s outputs, ensuring that its insights align with stakeholders’ needs and educational contexts. Additionally, this approach could helps fine-tuning the model by integrating automated analysis and human expertise. As a result, implementing such a human-in-the-loop strategy could lead to a deeper understanding of the relationship between language patterns in PF text messages and educational outcomes, ultimately contributing to more personalized and effective learning interventions and improved learning outcomes.

In that context, LIME is prominent to foster transparency and trust. Because LIME enables understanding RoBERTa’s individual predictions by highlighting the components of text that most influence the model’s outcomes, it helps in building trust with users who would rely on these insights for making informed decisions. By highlighting how specific features, such as stopwords, impact the analysis, LIME supports educators in validating and refining the model’s predictions. Consequently, this would enhance the credibility of automated feedback systems and encourage their integration into educational practices.

This paper has several limitations. First, the dataset used in this study was unbalanced, which indicates models might favor the classes in which the dataset includes the most instances. There are several methods to deal with data imbalance, such as undersampling the classes that have the most data and oversampling the classes that have less data so that all classes have the same number of labels. Additionally, the default hyperparameters were used for all models, and the models’ performances could be improved if the hyperparameters were tuned. The hyperparameters of the models can be fine-tuned using a variety of methods to improve the final performance of the models [37]. Furthermore, the GPT result could also be improved by incorporating different elements, such as examples (few-shoot prompts). Lastly, the data was split in this case so that 80% was used for training and 20% was used for testing. The amount of data used for training and testing can be changed, however, the model’s performance may vary. Therefore, methods such as cross-validation can be used, which means that different subsets of the data are divided up for training and testing. Similarly, the transformer-based models could be further investigated in terms of different architectures and training procedures. Finally, finding a model that has already been pre-trained on peer assessment feedback or similar data could be valuable in understanding how applicable it is to the dataset used in this study and similar ones.

6 Conclusion

The strong performance of transformer-based models, particularly RoBERTa suggests a shift in the way natural language processing tasks like peer feedback classification can be approached in the CSCL educational settings. By outperforming traditional models and prior approaches, this study demonstrates the potential of transformers to enhance learning analytics and AI-informed applications in education, particularly when analyzing nuanced, unstructured text data such as in the case of this study. This could pave the way for more accurate automated feedback systems, better peer interaction analysis, and improved instructional design in collaborative learning environments. Overall, the use of transformer models, especially RoBERTa, along with explainable AI, in classifying peer feedback in CSCL settings not only surpasses traditional models but also sets a new benchmark in terms of accuracy and reliability, marking a notable improvement in educational natural language processing applications.

Acknowledgments

This study was partially supported by Digital Futures (Stockholm, Sweden) and also by the Conselho Nacional de Desenvolvimento Científico e Tecnológico (310888/2021-2). Finally, we would like to thank OpenAI research program for offering the credits required for this experiment and all the study participants.

Footnotes

https://platform.openai.com/docs/models/

https://platform.openai.com/docs/api-reference/authentication

References

[1]

Jaweriah Alvi. 2021. Explainable Multimodal Fusion.

Abstract

1 Introduction

2 Background

2.1 Peer Assessment and Peer Feedback

2.2 Classification of Feedback Messages

2.3 Language Models and Explainable AI

2.4 Research Questions

3 Method

3.1 Dataset

3.2 Model Selection and Evaluation

4 Results

4.1 RQ1: Machine Learning Classification of PF messages

4.2 RQ2: XAI to Understand Feedback Classifications

5 Discussion

6 Conclusion

Acknowledgments

Footnotes

References

Index Terms

Recommendations

Peer feedback to support collaborative knowledge improvement: What kind of feedback feed-forward?

Survey on Explainable AI: Techniques, challenges and open issues

Educational Robotics and Dyslexia: Investigating How Reinforcement Learning in Robotics Can Be Used to Help Support Students with Dyslexia

Comments

Information

Published In

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

View options

PDF

eReader

Login options

Full Access

Share

Share this Publication link

Share on social media

Affiliations