Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3706468.3706526acmotherconferencesArticle/Chapter ViewFull TextPublication PageslakConference Proceedingsconference-collections
short-paper
Open access

That's What RoBERTa Said: Explainable Classification of Peer Feedback

Published: 03 March 2025 Publication History

Abstract

Peer feedback (PF) is essential for improving student learning outcomes, particularly in Computer-Supported Collaborative Learning (CSCL) settings. When using digital tools for PF practices, student data (e.g., PF text entries) is generated automatically. Analyzing these large datasets can enhance our understanding of how students learn and help improve their learning. However, manually processing these large datasets is time-intensive, highlighting the need for automation. This study investigates the use of six machine learning models to classify PF messages from 231 students in a large university course. The models include Multi-Layer Perceptron (MLP), Decision Tree, BERT, RoBERTa, DistilBERT, and ChatGPT4o. The models were evaluated based on Cohen’s accuracy and F1-score. Preprocessing involved removing stop words, and the impact of this on model performance was assessed. Results showed that only the Decision Tree model improved with stop-word removal, while performance decreased in the other models. RoBERTa consistently outperformed the others across all metrics. Explainable AI was used to understand RoBERTa’s decisions by identifying the most predictive words. This study contributes to the automatic classification of peer feedback which is crucial for scaling learning analytics efforts aiming to provide better in-time support to students in CSCL settings.

1 Introduction

Peer feedback (PF) is a key component of online education [21], especially in Computer-Supported Collaborative Learning (CSCL) settings [12] and predictive to learning performance in educational settings [30]. When working in CSCL settings, students generate large volumes of PF data (largely in the form of text entries) that could be used to not only improve our understanding of their learning processes in these settings but also to provide improved in-time support to the learner. This is important since the specificity or the quality of PF affects the learning improvement level [21]. Such quality can be improved by instructional support based on enhanced insights into the students’ nature of PF. In earlier research focusing on the exploration of the PF nature, scholars have largely classified student PF data using qualitative coding [17, 43], which is a very time-consuming task that is not feasible for many stakeholders, including researchers and teachers.
Because PF activities are increasingly practiced in large courses (with > 100 students [43]), the analysis of student PF data needs to be automatized (e.g., [22]) to better support teachers, instructors, and examiners to identify areas in which learners could improve throughout the course as well as after it and ultimately, provide support that is conducive for students’ learning. Exploring Natural Language Processing (NLP) techniques could address this gap, as NLP has previously been effective in identifying patterns in large datasets for educational contexts [16, 22].
Previous work has been done in analyzing instructor-based feedback [8, 10]. However, it is unclear what NLP model would show the best results when processing and interpreting PF data [21]. Specifically, grasping the nature of PF messages to understand what the feedback was aimed at (i.e., meaning) could be challenging. Therefore, there is a need for more research on and a better understanding of how NLP models can particularly identify the nature of student PF in (PF) text entries data in CSCL settings [12].
To address the above-mentioned gap, this study investigates the effectiveness of six selected NLP models in automating the classification of student PF. Particularly, we examine how these models accurately identify the different aspects of the nature of students’ feedback in peer evaluation. For this, we compare the two traditional algorithms (Decision Tree and Multilayer perceptron - MLP), the three transformers-based models (BERT, RoBERTa, and DistilBERT), which have demonstrated state-of-the-art performance in several tasks [10] and GPT4o, a multimodal large language model [48]. Moreover, to gain a better understanding of how the NLP model makes its decisions, we implement Explainable Artificial Intelligence (XAI) [2] that aims at offering explanations about why a given model is making different decisions and to identify potential biases and errors.

2 Background

2.1 Peer Assessment and Peer Feedback

Peer assessment is a process in which students evaluate and provide feedback on each other’s performance [13]. It is an educational strategy where students assess the other students’ quality and provide helpful feedback to improve their learning process [42]. In a participatory learning context, peer assessment is critical in providing individuals or groups with valuable information to enhance their performance [26, 35]. Furthermore, peer assessment is usually suggested to increase students’ engagement with their own learning [18].
Peer assessment and PF are interrelated processes in the collaborative learning context [27]. As the outcome of the peer assessment process, in which a student evaluates the performance of another student (i.e., the peer), PF can be offered to the student whose work has been evaluated [43]. Receiving PF allows students to reflect on their own work and make more adjustments and improvements [13]. PF has been demonstrated to improve student learning and academic performance (e.g.,[30, 40]). Simonsmeier et al. [40] stress that PF is an effective method for improving students’ academic self-concept in the setting of academic writing. In that context, Gielen and De Wever [18] reviewed 109 papers on peer assessment, where the results reveal that the practice offers benefits for both assessors and those being assessed in multiple ways. These benefits include promoting constructive reflection, increased time on task, attention to important parts of quality work, and a greater sense of responsibility.
One crucial aspect of effective PF is its nature, which is understood in terms of the specificity and the meaning of the information in the feedback. Huang et al. [21] showcased the quality, or the nature of PF affected the learning improvement level. Scholars have earlier examined the PF nature, with varying results. Ion et al. [23] performed a study on the nature of PF using semantic analysis and identified three types of feedback: (1) task-oriented feedback focused on motivation, (2) feedback related to emotional aspects, and (3) feedback concerning the structural aspects of the activity. Chien et al. [14] classified PF into four categories: praise, opinion, criticism, and irrelevance. Further, scholars have classified PF focusing on its affective and cognitive nature (e.g.,[17]). All the above mentioned examples were conducted based on the manual coding process and demonstrated varying results. To gain more insights into the nature of PF, there is a need to investigate how we can analyze large amounts of text data from PF activities to further unravel the nature of PF, and provide in-time student support for struggling students in CSCL seetings. In this study, we implemented a bottom-up approach, in which we explore the affordances of machine learning to analyze a large amount of text data (n = 231 students, for more, see Method) from PF activities in the CSCL setting of engineering higher education.

2.2 Classification of Feedback Messages

Previous works have used NLP to classify feedback messages. For instance, Cavalcanti et al. [7] conducted a content analysis of instructor feedback. Their study focused on assessing the quality of instructor feedback extracted from assessments in Brazilian Portuguese. Using a random forest classifier, the authors achieved an accuracy of 0.75 and a Cohen’s kappa of 0.20. Similarly, [34] applied the XGBoost classifier to a dataset of feedback texts in English. Their model achieved accuracy values of 0.87, 0.82, and 0.69 for self, task, and process feedback levels[19], respectively. Additionally, the study investigated the most important textual features associated with each feedback level. In another study, [38] proposed a multi-label classification approach for feedback levels [19], incorporating hyperparameter tuning to enhance model performance. The authors experimented with support vector machine, random forest, and k-nearest neighbors classifiers. In the best case, the support vector machine algorithm achieved a 0.871 f-score.
Regarding analyzing PF messages, Castro et al. [5] SVM, KNN, Logistic Regression, Naive Bayes, Random Forest, AdaBoost, XGBoost, and CRF models for this task. The three compared methods for representing features in these models were content-based features (TF-IDF), content-independent, and sequential features. The results demonstrated that using content-based features with TF-IDF resulted in the best performance for the models, with the top-performing model being CRF, reaching Cohen’s κ of 0.43.
Recently, Cavalcanti et al. [10] explored applying the BERT transformer model to classify instructor feedback based on established educational frameworks. The proposed method significantly improved over previous approaches, with BERT outperforming traditional machine learning models by up to 35.71%, reaching up to 0.76 in Cohen’s kappa. Also, the authors integrated XAI techniques to enhance the interpretability of the model’s predictions, indicating the most influential features in the classification process. The gains obtained by Cavalcanti et al. [10] demonstrate the potential of using BERT in feedback classification, as it outperformed traditional models. Moreover, integrating XAI ensures transparency in these predictions, giving educators more precise insights into the factors influencing feedback quality.
Finally, Hutt et al. [22] examined both classical NLP techniques and ChatGPT to devise an automated detector of peer feedback quality in K-12 settings. Classical NLP detectors that combine simplistic NLP approaches and supervise machine learning–were found to be more accurate in this task than ChatGPT.

2.3 Language Models and Explainable AI

The literature highlights that BERT models have produced significant results in the education domain [10, 20]. In short, BERT is a deeply bidirectional language model designed to capture contextual representations in language by training on massive datasets, such as the Books Corpus and Wikipedia [15]. This extensive pre-training enables BERT to generate sophisticated contextual embeddings, making it highly effective for various NLP tasks.
Besides the traditional architecture, BERT model also have some variations. RoBERTa enhances the original BERT model, addressing limitations in BERT’s training by employing dynamic masking, multiplying the data tenfold to expose the model to varied masking patterns within the same sequence [31]. Unlike BERT, which uses static masking and masks each sequence once, RoBERTa’s dynamic approach improves training diversity. Similarly, DistilBERT, another variant of BERT, maintains the same architecture but with 50% fewer layers and 40% fewer parameters [39], aiming to offer a faster, more efficient version of BERT while retaining 97% of its performance.
Recently, GPT models have been widely recognized in academic research for their ability to address various challenges [41, 48]. For instance, studies have demonstrated the effectiveness of using GPT models, even in zero-shot scenarios, to assess essays [4] and short responses [11] automatically. Moreover, recent studies show the potential of GPT for supporting the feedback process [47].
Language models have demonstrated promising results in educational settings. Yet, a significant challenge for their practical adoption is the lack of transparency in how they generate predictions, making it difficult to understand or trust the results of these algorithms [2]. The field of Explainable Artificial Intelligence (XAI) addresses this issue by providing insights into the decision-making processes of these models, ultimately enhancing trust and usability.
An example of an XAI method is the Local Interpretable Model-Agnostic Explanations (LIME), which highlights parts of the input that contribute to a model’s predictions [45]. The primary aim of LIME is "to identify an interpretable model over the interpretable representation that is locally faithful to the classifier" [36]. Rather than attempting to create a globally interpretable model, LIME trains local models to explain individual predictions [1]. These interpretable models are designed to offer clear insights while approximating the underlying model’s behavior as accurately as possible, offering interpretability to the original predictions [1].

2.4 Research Questions

Based on the increasing use of PF in educational settings, particularly in the CSCL setting of large courses [24], there is a growing need to automate the analysis of large datasets generated by students. Earlier research has focused mainly on analyzing instructor feedback, but the classification of student PF needs further investigation. This study aims to fill this gap by evaluating the performance of various NLP models in identifying the nature of PF messages. Furthermore, we investigate the role of XAI in providing insights into the predictions made by the top-performing model. These goals are reflected in the following research questions:
RQ1: To what extent can NLP models accurately identify the nature of peer feedback messages based on text data?
RQ2: To what extent can XAI deepen our understanding of the decision-making processes of the best performing NLP model?

3 Method

This research has been approved by an internal institutional ethical committee (Nr: V-2021-0392).

3.1 Dataset

The dataset used in this study consists of PF messages collected from a large project-based university course in Europe. It involves data from 231 students who collaboratively worked on their projects in groups of four to six students. In total, the dataset contains 2,444 feedback messages written in English, which were divided into 10,319 sentences. These sentences were derived from peer assessments that were part of both formative and summative evaluations. Overall each student provided anonymous feedback to four to five peers in their group, summing up to between eight and 12 texts of feedback throughout the course. Then, these messages were reviewed by the course teaching assistants and the course examiner to assess individual student contributions. The CLASS system enabled all these PF interactions (see [6]).
Once collected, the data was cleaned and annotated using an exploratory approach based on the Boyatzis framework [3]. This approach allowed for feedback categories to emerge naturally from the text data based on the annotation process that involved four main steps. First, two annotators analyzed a small subset of the data (3%) and identified six categories: management, suggestions for improvement, interpersonal factors, cognition, affect, and miscellaneous, which align with established taxonomies in CSCL and feedback research [5, 43]. Second, a sample of 500 feedback sentences was coded separately by two annotators, which achieved a moderate level of agreement (Cohen’s kappa = 0.65). Third, discrepancies were discussed and re-annotated, with a final Cohen’s kappa value of 0.86, indicating a strong agreement among raters. Lastly, the remaining dataset was coded by the annotators independently. Table 1 presents examples from this dataset along with a description of each feedback class and the number of samples per class.
Table 1:
CategoryDescriptionExample
Management (n = 1932)All contributions to the group and the student’s tasks during the project. No emotional opinion. All task-related behaviors.You also handled your interview nicely, asking good follow up questions to clear up any potential misunderstandings.
Affect (n = 1612)Effect of the student on the group or on the feedback giving person.You have contributed actively to the design process.
Interpersonal factors (n = 207)All interaction, both positive and negative, and the process of working in a group.Is always willing to collaborate with the rest of the group and participate in the discussions.
Suggestions for improvement (n = 231)Suggestions on how to improve on a certain domain.His positive attitude towards all ideas means a lot to many of our group members, as group work can usually generate a lot of negativity.
Cognition (n = 546)Feedback related to thinking and inspiration.You work well in a group and often provide good points during discussions.
Miscellaneous (n = 240)Interesting feedback that does not fit the other categories.don’t think much has changed since the previous peer feedback, and therefore I don’t have much to add regarding your performance.
Table 1: Samples from the dataset [6].

3.2 Model Selection and Evaluation

Six models have been implemented: two traditional and four transformer-based, including GPT4o. For developing both traditional and transformer models, this study utilized an NVIDIA P100 GPU with 16 GB of RAM, where all experiments were performed on the Kaggle platform.
The traditional models used in this study were MLP and Decision Tree, both implemented using the Python library Scikit-learn with default parameter settings. These models were selected due to their effectiveness across various studies in previous text classification tasks [9, 16]. To represent the input data, we employed TF-IDF (Term Frequency-Inverse Document Frequency) [32] as the feature extraction method commonly used for text analysis to weigh the importance of words within the dataset[16].
The transformer-based models utilized in this study were BERT, RoBERTa, and DistilBERT, all implemented using the SimpleTransformers library, which is built upon the widely adopted Hugging Face Transformers framework. These models were configured with their default parameter settings, ensuring consistency with standard practices in transformer-based natural language processing tasks while leveraging their robust pre-trained architectures for effective text classification.
We also integrated a GPT model to explore the potential of large-scale language models in this context. We adopted the gpt-4o-2024-08-06 model1 through the OpenAI API2, which was the most advanced version available. It is well known that the design of these prompts plays a key role in the model’s capacity to generate accurate and relevant results [46]. In our study, the prompts were carefully tailored to the specific context of the task, incorporating clear and direct instructions and specifying the desired output format [33]. Table 2 presents the final prompt used in this study.
Table 2:
ElementText
ContextPeer feedback written by higher education students in the engineering education setting.
InstructionDetermine which category of peer feedback is predominant in the following.
 You should consider the following categories: Management, Suggestions for improvement,
 Interpersonal factors, Cognition, Affect, and Miscellaneous.
Output formatindicate only one category in the response
Table 2: Prompt for GPT 4o
To answer the first research question, we split the dataset into a training set (80%) and a test set (20%) to facilitate the evaluation process. To assess the models’ performance, we employed a comprehensive set of widely recognized metrics, including accuracy and Cohen’s kappa, which are commonly used in similar studies to compare classification effectiveness in the educational domain [16].
To answer the second research question, we used the LIME library, accessible through Python, to implement LIME for interpretability. This study applied LIME to the top-performing model to provide insights into its decision-making process. A subset of the dataset was selected, allowing LIME to explain the model’s predictions by highlighting the specific words that influenced the classification. LIME assigns probabilities to these words, demonstrating their contribution to determining whether a sentence belongs to a particular class, thus offering a clearer understanding of the factors driving the model’s predictions.

4 Results

4.1 RQ1: Machine Learning Classification of PF messages

Table 3 presents the performance of the models evaluated in this study. The results reveal that RoBERTa had the highest performance across all metrics with an accuracy of 73.1 and a Cohen’s κ score of 0.61, which is considered a substantial agreement with the ground truth [28]. All the other BERT variations outperformed the traditional methods. Moreover, it is important to highlight the fact that although GPT4o has not achieved performance comparable to BERT, it has not used any data in its classification process.
Table 3:
ModelAccuracyκ
Decision Tree54.70.38
MLP63.70.48
BERT70.30.60
RoBERTa73.10.61
DistilBERT69.60.59
GPT4o62.30.49
Table 3: Results for the analyzed models

4.2 RQ2: XAI to Understand Feedback Classifications

Given RoBERTa’s highest performance, LIME has been used to gain insights into the reasoning behind the model’s choices in some contexts from the dataset. For this, we present two examples, considering different feedback classes, where words shaded in a darker color indicate (purple and blue) a higher probability for the feedback to be classified within the considered class, while those with lighter shading indicate a lower probability.
Figure 1a shows an example of a feedback connected to the category ’Cognition’. Words like “afraid”, “someone”, “open” have the highest probability of explaining the model’s decision. Figure 1b presents an example of a text from the category ’Miscellaneous’. In this case, "could" and "improve" were the most predictive words.
Figure 1:
Figure 1: Highlighted words and important features of the examples.

5 Discussion

This study aimed at improving our understanding of how the classification of student PF text data could be automated (through machine learning) to scale up related learning analytics interventions and support (e.g., adaptive scaffolding of students) in CSCL settings. We examined six machine learning models on their performance to classify PF text messages from students in a large CSCL course (n = 231 students) in higher education. We extend previous work using NLP to automatically classify student PF text data, with a focus on the student, to better understand the nature of their feedback to the peer. We found the transformers-based models (BERT, RoBERTa and DistilBERT) outperformed the traditional models MLP and Decision Tree as well as ChatGPT4o. The latter finding is in line with the recent results by Hutt et al. [22] who have similarly shown that ChatGPT is outperformed by classical NLP techniques in automated detection of (peer) feedback quality. In the future, different kinds of prompting strategies that have been found to be predictive of the quality of LLMs-generated outputs [25] and other LLMs have to be examined in similar settings and beyond.
The results of the present study build on the findings by [22] by presenting another effective way of using NLP to classify student PF text data. The traditional models probably struggled to perform as well as transformers-based models since they lacked pre-training. RoBERTa achieved an accuracy of 73.1% and a Kappa score of 0.61, indicating moderate agreement [29]. In that sense, RoBERTa’s performance is regarded as reliable since the study by Castro et al. [6] achieved the highest F1-score of 0.58 and a Cohen’s κ of 0.43 with CRF. Moreover, it is aligned with the previous paper on classifying feedback messages where BERT variations reached better results [10]. This finding has the potential to scale up the automated coding of student PF text data for improved learner and teaching support based on LA interventions. However, before scaling this up, we are calling for replication studies in this regard; datasets from similar educational contexts. Furthermore, the application of RoBERTa and other transformer-based ML models on datasets in languages other than English has to be further explored [16].
Further, this study aimed at improving our understanding of the role of XAI, namely the LIME [45] to understand the decision-making processes of the best performing model. The LIME method was implemented to analyze RoBERTa’s prediction on two contexts selected from the dataset. The most predictive words highlighted by XAI are aligned with the theory behind the classes we were evaluating [44].
A practical implication of these findings would be that RoBERTa has the potential to accurately identify the nature of PF text data, providing useful insights for educators and researchers in the field. Therefore, this model can be used to aid the examiner, teachers, and instructors in analyzing and identifying the nature of PF and potentially its relation to performance to support future learning processes within that learning environment. A specific implication for using NLP models to analyze PF text data would be to take into account if there are stopwords in the text data. By using XAI to analyze three different contexts, we determined that the consideration of specific stopwords is a crucial factor in RoBERTa’s decision making for predicting.
For instance, one might explore our findings to design a human-in-the-loop approach that could enhance the interpretability and reliability of the model’s outputs. Involving educators or domain specialists into the data analysis process would allow for an iterative refinement of the model’s outputs, ensuring that its insights align with stakeholders’ needs and educational contexts. Additionally, this approach could helps fine-tuning the model by integrating automated analysis and human expertise. As a result, implementing such a human-in-the-loop strategy could lead to a deeper understanding of the relationship between language patterns in PF text messages and educational outcomes, ultimately contributing to more personalized and effective learning interventions and improved learning outcomes.
In that context, LIME is prominent to foster transparency and trust. Because LIME enables understanding RoBERTa’s individual predictions by highlighting the components of text that most influence the model’s outcomes, it helps in building trust with users who would rely on these insights for making informed decisions. By highlighting how specific features, such as stopwords, impact the analysis, LIME supports educators in validating and refining the model’s predictions. Consequently, this would enhance the credibility of automated feedback systems and encourage their integration into educational practices.
This paper has several limitations. First, the dataset used in this study was unbalanced, which indicates models might favor the classes in which the dataset includes the most instances. There are several methods to deal with data imbalance, such as undersampling the classes that have the most data and oversampling the classes that have less data so that all classes have the same number of labels. Additionally, the default hyperparameters were used for all models, and the models’ performances could be improved if the hyperparameters were tuned. The hyperparameters of the models can be fine-tuned using a variety of methods to improve the final performance of the models [37]. Furthermore, the GPT result could also be improved by incorporating different elements, such as examples (few-shoot prompts). Lastly, the data was split in this case so that 80% was used for training and 20% was used for testing. The amount of data used for training and testing can be changed, however, the model’s performance may vary. Therefore, methods such as cross-validation can be used, which means that different subsets of the data are divided up for training and testing. Similarly, the transformer-based models could be further investigated in terms of different architectures and training procedures. Finally, finding a model that has already been pre-trained on peer assessment feedback or similar data could be valuable in understanding how applicable it is to the dataset used in this study and similar ones.

6 Conclusion

The strong performance of transformer-based models, particularly RoBERTa suggests a shift in the way natural language processing tasks like peer feedback classification can be approached in the CSCL educational settings. By outperforming traditional models and prior approaches, this study demonstrates the potential of transformers to enhance learning analytics and AI-informed applications in education, particularly when analyzing nuanced, unstructured text data such as in the case of this study. This could pave the way for more accurate automated feedback systems, better peer interaction analysis, and improved instructional design in collaborative learning environments. Overall, the use of transformer models, especially RoBERTa, along with explainable AI, in classifying peer feedback in CSCL settings not only surpasses traditional models but also sets a new benchmark in terms of accuracy and reliability, marking a notable improvement in educational natural language processing applications.

Acknowledgments

This study was partially supported by Digital Futures (Stockholm, Sweden) and also by the Conselho Nacional de Desenvolvimento Científico e Tecnológico (310888/2021-2). Finally, we would like to thank OpenAI research program for offering the credits required for this experiment and all the study participants.

Footnotes

References

[1]
Jaweriah Alvi. 2021. Explainable Multimodal Fusion.
[2]
Alejandro Barredo Arrieta, Natalia Díaz-Rodríguez, Javier Del Ser, Adrien Bennetot, Siham Tabik, Alberto Barbado, Salvador García, Sergio Gil-López, Daniel Molina, Richard Benjamins, et al. 2020. Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI. Information fusion 58 (2020), 82–115.
[3]
Richard E Boyatzis. 1998. Transforming qualitative information: Thematic analysis and code development. sage.
[4]
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language Models are Few-Shot Learners. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (Eds.), Vol. 33. Curran Associates, Inc., 1877–1901. https://proceedings.neurips.cc/paper_files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf
[5]
Mayara Simões de Oliveira Castro, Rafael Ferreira Mello, Giuseppe Fiorentino, Olga Viberg, Daniel Spikol, Martine Baars, and Dragan Gašević. 2023. Understanding peer feedback contributions using natural language processing. (2023), 399–414.
[6]
Mayara Simões de Oliveira Castro, Rafael Ferreira Mello, Giuseppe Fiorentino, Olga Viberg, Daniel Spikol, Martine Baars, and Dragan Gašević. 2023. Understanding peer feedback contributions using natural language processing. (2023), 399–414.
[7]
Anderson Pinheiro Cavalcanti, Rafael Ferreira Leite de Mello, Vitor Rolim, Máverick André, Fred Freitas, and Dragan Gaševic. 2019. An Analysis of the use of Good Feedback Practices in Online Learning Courses. In 2019 IEEE 19th International Conference on Advanced Learning Technologies (ICALT), Vol. 2161. IEEE, 153–157.
[8]
Anderson Pinheiro Cavalcanti, Arthur Diego, Rafael Ferreira Mello, Katerina Mangaroska, André Nascimento, Fred Freitas, and Dragan Gašević. 2020. How good is my feedback? a content analysis of written feedback. In Proceedings of the Tenth International Conference on Learning Analytics & Knowledge. 428–437.
[9]
Anderson Pinheiro Cavalcanti, Arthur Diego, Rafael Ferreira Mello, Katerina Mangaroska, André Nascimento, Fred Freitas, and Dragan Gaševic. 2020. How good is my feedback? A Content Analysis of Written Feedback. In Proceedings of the 10th International Conference on Learning Analytics and Knowledge - LAK. ACM.
[10]
Anderson Pinheiro Cavalcanti, Rafael Ferreira Mello, Dragan Gašević, and Fred Freitas. 2023. Towards explainable prediction feedback messages using BERT. International Journal of Artificial Intelligence in Education (2023), 1–26.
[11]
Imran Chamieh, Torsten Zesch, and Klaus Giebermann. 2024. LLMs in Short Answer Scoring: Limitations and Promise of Zero-Shot and Few-Shot Approaches. In Proceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2024). 309–315.
[12]
Juanjuan Chen, Minhong Wang, Paul A Kirschner, and Chin-Chung Tsai. 2018. The role of collaboration, computer use, learning environments, and supporting strategies in CSCL: A meta-analysis. Review of Educational Research 88, 6 (2018), 799–843.
[13]
Juanjuan Chen, Minhong Wang, Paul A Kirschner, and Chin-Chung Tsai. 2018. The role of collaboration, computer use, learning environments, and supporting strategies in CSCL: A meta-analysis. Review of Educational Research 88, 6 (2018), 799–843.
[14]
Shu-Yun Chien, Gwo-Jen Hwang, and Morris Siu-Yung Jong. 2020. Effects of peer assessment within the context of spherical video-based virtual reality on EFL students’ English-Speaking performance and learning perceptions. Computers & Education 146 (2020), 103751.
[15]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:https://arXiv.org/abs/1810.04805 (2018).
[16]
Rafael Ferreira-Mello, Máverick André, Anderson Pinheiro, Evandro Costa, and Cristobal Romero. 2019. Text mining in education. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 9, 6 (2019), e1332.
[17]
Soon Yen Foo. 2021. Analysing peer feedback in asynchronous online discussions: A case study. Education and Information Technologies 26, 4 (2021), 4553–4572.
[18]
Mario Gielen and Bram De Wever. 2015. Structuring the peer assessment process: A multilevel approach for the impact on product improvement and peer feedback quality. Journal of Computer Assisted Learning 31, 5 (2015), 435–449.
[19]
John Hattie and Helen Timperley. 2007. The power of feedback. Review of educational research 77, 1 (2007), 81–112.
[20]
Yuanyuan Hu, Rafael Ferreira Mello, and Dragan Gašević. 2021. Automatic analysis of cognitive presence in online discussions: An approach using deep learning and explainable artificial intelligence. Computers and Education: Artificial Intelligence 2 (2021), 100037.
[21]
Changqin Huang, Yaxin Tu, Zhongmei Han, Fan Jiang, Fei Wu, and Yunliang Jiang. 2023. Examining the relationship between peer feedback classified by deep learning and online learning burnout. Computers & Education 207 (2023), 104910.
[22]
Stephen Hutt, Allison DePiro, Joann Wang, Sam Rhodes, Ryan S Baker, Grayson Hieb, Sheela Sethuraman, Jaclyn Ocumpaugh, and Caitlin Mills. 2024. Feedback on Feedback: Comparing Classic Natural Language Processing and Generative AI to Evaluate Peer Feedback. In Proceedings of the 14th Learning Analytics and Knowledge Conference. 55–65.
[23]
Georgeta Ion, Aleix Barrera-Corominas, and Marina Tomàs-Folch. 2016. Written peer-feedback to enhance students’ current and future learning. International Journal of Educational Technology in Higher Education 13 (2016), 1–11.
[24]
Nafiseh Taghizadeh Kerman, Seyyed Kazem Banihashem, Mortaza Karami, Erkan Er, Stan Van Ginkel, and Omid Noroozi. 2024. Online peer feedback in higher education: A synthesis of the literature. Education and Information Technologies 29, 1 (2024), 763–813.
[25]
Nils Knoth, Antonia Tolzin, Andreas Janson, and Jan Marco Leimeister. 2024. AI literacy and its implications for prompt engineering strategies. Computers and Education: Artificial Intelligence 6 (2024), 100225.
[26]
Ingo Kollar and Frank Fischer. 2010. Peer assessment as collaborative learning: a cognitive perspective. Learning and Instruction 20(4) (2010), 344–348. https://hal.science/hal-00703943
[27]
Ingo Kollar and Frank Fischer. 2010. Peer assessment as collaborative learning: A cognitive perspective. Learning and instruction 20, 4 (2010), 344–348.
[28]
J Richard Landis and Gary G Koch. 1977. The measurement of observer agreement for categorical data. biometrics (1977), 159–174.
[29]
J R Landis and G G Koch. 1977. The measurement of observer agreement for categorical data. Biometrics (1977).
[30]
Hongli Li, Yao Xiong, Charles Vincent Hunter, Xiuyan Guo, and Rurik Tywoniw. 2020. Does peer assessment promote student learning? A meta-analysis. Assessment & Evaluation in Higher Education 45, 2 (2020), 193–211.
[31]
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:https://arXiv.org/abs/1907.11692 (2019).
[32]
Christopher Manning and Hinrich Schutze. 1999. Foundations of statistical natural language processing. MIT press.
[33]
Rafael Ferreira Mello, Luiz Rodrigues, Erverson Sousa, Hyan Batista, Mateus Lins, Andre Nascimento, and Dragan Gasevic. 2024. Automatic Detection of Narrative Rhetorical Categories and Elements on Middle School Written Essays. In International Conference on Artificial Intelligence in Education. Springer, 295–308.
[34]
Ikenna Osakwe, Guanliang Chen, Alex Whitelock-Wainwright, Dragan Gašević, Anderson Pinheiro Cavalcanti, and Rafael Ferreira Mello. 2022. Towards automated content analysis of educational feedback: A multi-language study. Computers and Education: Artificial Intelligence 3 (2022), 100059.
[35]
Chris Phielix, Frans J Prins, and Paul A Kirschner. 2010. Awareness of group performance in a CSCL-environment: Effects of peer feedback and reflection. Computers in Human Behavior 26, 2 (2010), 151–161.
[36]
Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. Model-agnostic interpretability of machine learning. arXiv preprint arXiv:https://arXiv.org/abs/1606.05386 (2016).
[37]
Christine Rosquist. 2021. Text Classification of Human Resources-related Data with Machine Learning.
[38]
Dorian Ruiz Alonso, Claudia Zepeda Cortés, Hilda Castillo Zacatelco, and José Luis Carballido Carranza. 2022. Hyperparameter tuning for multi-label classification of feedbacks in online courses. Journal of Intelligent & Fuzzy SystemsPreprint (2022), 1–9.
[39]
Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2019. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:https://arXiv.org/abs/1910.01108 (2019).
[40]
Bianca A Simonsmeier, Henrike Peiffer, Maja Flaig, and Michael Schneider. 2020. Peer feedback improves students’ academic self-concept in higher education. Research in Higher Education 61 (2020), 706–724.
[41]
Yan Tao, Olga Viberg, Ryan S Baker, and René F Kizilcec. 2024. Cultural bias and cultural alignment of large language models. PNAS nexus 3, 9 (2024), pgae346.
[42]
Ineke Van den Berg, Wilfried Admiraal, and Albert Pilot. 2006. Designing student peer assessment in higher education: Analysis of written and oral peer feedback. Teaching in higher education 11, 2 (2006), 135–147.
[43]
Olga Viberg, Martine Baars, Rafael Ferreira Mello, Niels Weerheim, Daniel Spikol, Cristian Bogdan, Dragan Gasevic, and Fred Paas. 2024. Exploring the nature of peer feedback: An epistemic network analysis approach. Journal of Computer Assisted Learning (2024).
[44]
Olga Viberg, Anna Mavroudi, Ylva Fernaeus, Cristian Bogdan, and Jarmo Laaksolahti. 2020. Reducing free riding: CLASS–a system for collaborative learning assessment. In Methodologies and Intelligent Systems for Technology Enhanced Learning, 9th International Conference, Workshops. Springer, 132–138.
[45]
Klaus Virtanen. 2022. Using XAI Tools to Detect Harmful Bias in ML Models.
[46]
Jules White, Quchen Fu, Sam Hays, Michael Sandborn, Carlos Olea, Henry Gilbert, Ashraf Elnashar, Jesse Spencer-Smith, and Douglas C Schmidt. 2023. A prompt pattern catalog to enhance prompt engineering with chatgpt. arXiv preprint arXiv:https://arXiv.org/abs/2302.11382 (2023).
[47]
Lixiang Yan, Lele Sha, Linxuan Zhao, Yuheng Li, Roberto Martinez-Maldonado, Guanliang Chen, Xinyu Li, Yueqiao Jin, and Dragan Gašević. [n.d.]. Practical and ethical challenges of large language models in education: A systematic scoping review. British Journal of Educational Technology n/a, n/a ([n. d.]).
[48]
Zhuang Ziyu, Chen Qiguang, Ma Longxuan, Li Mingda, Han Yi, Qian Yushan, Bai Haopeng, Zhang Weinan, and Ting Liu. 2023. Through the Lens of Core Competency: Survey on Evaluation of Large Language Models. In Proceedings of the 22nd Chinese National Conference on Computational Linguistics (Volume 2: Frontier Forum), Jiajun Zhang (Ed.). Chinese Information Processing Society of China, Harbin, China, 88–109. https://aclanthology.org/2023.ccl-2.8

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
LAK '25: Proceedings of the 15th International Learning Analytics and Knowledge Conference
March 2025
1018 pages
ISBN:9798400707018
DOI:10.1145/3706468

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 03 March 2025

Check for updates

Author Tags

  1. Peer feedback
  2. Higher Education
  3. Machine Learning
  4. Explainable artificial intelligence
  5. Computer Supported Collaborative Learning.

Qualifiers

  • Short-paper

Conference

LAK 2025

Acceptance Rates

Overall Acceptance Rate 236 of 782 submissions, 30%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 0
    Total Downloads
  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 19 Feb 2025

Other Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media