Abstract
Automatically generating feedback via large language models (LLMs) in intelligent tutoring systems and online learning platforms has the potential to improve the learning outcomes of many students. However, both feedback generation and evaluation are challenging: feedback content has to be valid especially in subjects like math, which requires models to understand the problem, the solution, and where the student’s error lies. Feedback also has to be pedagogically valid to reflect effective tutoring strategies, such as explaining possible misconceptions and encouraging the student, among other desirable features. In this work, we address both problems of automatically generating and evaluating feedback while considering both correctness and alignment. First, we propose a rubric for evaluating math feedback and show that GPT-4 is able to effectively use it to annotate human-written and LLM-generated feedback. Second, we propose a framework for feedback generation that optimizes both correctness and alignment using reinforcement learning (RL). Specifically, we use GPT-4’s annotations to create preferences over feedback pairs in an augmented dataset for training via direct preference optimization (DPO). We show that our methods significantly increase the correctness and alignment of generated feedback with Llama 2, an open-source LLM, qualitatively analyze our generation and evaluation systems using case studies, and outline several areas for future work. (Our code is available at https://github.com/umass-ml4ed/feedback-gen-dpo).
The authors thank Schmidt Futures and the NSF (under grants IIS-2118706 and IIS-2237676) for partially supporting this work.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Al-Hossami, E., Bunescu, R., Teehan, R., Powell, L., Mahajan, K., Dorodchi, M.: Socratic questioning of novice debuggers: a benchmark dataset and preliminary evaluations. In: Proceedings of the Workshop on Innovative Use of NLP for Building Educational Applications (BEA2023@ACL), pp. 709–726 (2023)
Boaler, J.: Ability and mathematics: the mindset revolution that is reshaping education. Forum 55, 143–152 (2013)
Botelho, A., Baral, S., Erickson, J.A., Benachamardi, P., Heffernan, N.T.: Leveraging natural language processing to support automated assessment and feedback for student open responses in mathematics. J. Comput. Assist. Learn. 39(3), 823–840 (2023)
Chen, M., et al.: Evaluating large language models trained on code (2021)
Chen, W., Ma, X., Wang, X., Cohen, W.W.: Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. arXiv preprint arXiv:2211.12588 (2022)
Chiang, C.H., Lee, H.V.: Can large language models be an alternative to human evaluations? arXiv preprint arXiv:2305.01937 (2023)
Dettmers, T., Lewis, M., Belkada, Y., Zettlemoyer, L.: Llm.int8(): 8-bit matrix multiplication for transformers at scale (2022)
Hu, E.J., et al.: Lora: Low-rank adaptation of large language models (2021)
Jia, Q., Cui, J., Xiao, Y., Liu, C., Rashid, P., Gehringer, E.F.: All-in-one: multi-task learning BERT models for evaluating peer assessments. arXiv preprint arXiv:2110.03895 (2021)
Jia, Q., et al.: Insta-reviewer: a data-driven approach for generating instant feedback on students’ project reports. International Educational Data Mining Society (2022)
Kakarla, S., Thomas, D., Lin, J., Gupta, S., Koedinger, K.R.: Using large language models to assess tutors’ performance in reacting to students making math errors. arXiv preprint arXiv:2401.03238 (2024)
Kochmar, E., Vu, D.D., Belfer, R., Gupta, V., Serban, I.V., Pineau, J.: Automated personalized feedback improves learning gains in an intelligent tutoring system. In: International Conference on Artificial Intelligence in Education. pp. 140–146 (2020)
Kocmi, T., Federmann, C.: Large language models are state-of-the-art evaluators of translation quality. arXiv preprint arXiv:2302.14520 (2023)
Kojima, T., Gu, S.S., Reid, M., Matsuo, Y., Iwasawa, Y.: Large language models are zero-shot reasoners. Adv. Neural. Inf. Process. Syst. 35, 22199–22213 (2022)
Lan, A.S., Vats, D., Waters, A.E., Baraniuk, R.G.: Mathematical language processing: automatic grading and feedback for open response mathematical questions. In: Proceedings of the ACM Conference on learning@scale, pp. 167–176 (2015)
Lee, H., et al.: Rlaif: Scaling reinforcement learning from human feedback with AI feedback. arXiv preprint arXiv:2309.00267 (2023)
Lin, C.Y.: ROUGE: a package for automatic evaluation of summaries. In: Text Summarization Branches Out, Barcelona, Spain, pp. 74–81. Association for Computational Linguistics (2004)
Liu, N., Sonkar, S., Wang, Z., Woodhead, S., Baraniuk, R.G.: Novice learner and expert tutor: evaluating math reasoning abilities of large language models with misconceptions. arXiv preprint arXiv:2310.02439 (2023)
Liu, N., Wang, Z., Baraniuk, R., Lan, A.: Open-ended knowledge tracing for computer science education. In: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 3849–3862 (2022)
McNichols, H., et al.: Automated distractor and feedback generation for math multiple-choice questions via in-context learning. In: NeurIPS’23 Workshop on Generative AI for Education (2023)
McNichols, H., Zhang, M., Lan, A.: Algebra error classification with large language models. In: International Conference on Artificial Intelligence in Education, pp. 365–376 (2023)
Naismith, B., Mulcaire, P., Burstein, J.: Automated evaluation of written discourse coherence using GPT-4. In: Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023), Toronto, Canada, pp. 394–403. Association for Computational Linguistics (2023)
Nguyen, H.A., Stec, H., Hou, X., Di, S., McLaren, B.M.: Evaluating chatgpt’s decimal skills and feedback generation in a digital learning game. In: Responsive and Sustainable Educational Futures, pp. 278–293 (2023)
Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Manning, C.D., Finn, C.: Direct preference optimization: your language model is secretly a reward model (2023)
Razzaq, R., Ostrow, K.S., Heffernan, N.T.: Effect of immediate feedback on math achievement at the high school level. In: Bittencourt, I.I., Cukurova, M., Muldner, K., Luckin, R., Millán, E. (eds.) AIED 2020. LNCS (LNAI), vol. 12164, pp. 263–267. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-52240-7_48
Reimers, N., Gurevych, I.: Sentence-BERT: sentence embeddings using Siamese BERT-networks. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics (2019)
Robinson, J.D., Chuang, C.Y., Sra, S., Jegelka, S.: Contrastive learning with hard negative samples. In: International Conference on Learning Representations (2021)
Shinn, N., Cassano, F., Labash, B., Gopinath, A., Narasimhan, K., Yao, S.: Reflexion: language agents with verbal reinforcement learning. arXiv preprint arXiv:2303.1136614 (2023)
Singh, R., Gulwani, S., Solar-Lezama, A.: Automated feedback generation for introductory programming assignments. In: Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation, pp. 15–26 (2013)
Song, D., Lee, W., Oh, H.: Context-aware and data-driven feedback generation for programming assignments. In: Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pp. 328–340 (2021)
Sonkar, S., Le, M., Chen, X., Liu, N., Mallick, D.B., Baraniuk, R.G.: Code soliloquies for accurate calculations in large language models. arXiv preprint arXiv:2309.12161 (2023)
Steiss, J., et al.: Comparing the quality of human and ChatGPT feedback on students’ writing (2023)
Sun, K.L.: Brief report: the role of mathematics teaching in fostering student growth mindset. J. Res. Math. Educ. 49(3), 330–335 (2018)
Touvron, H., et al.: Llama 2: open foundation and fine-tuned chat models (2023)
Wolf, T., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019)
Zhang, M., Baral, S., Heffernan, N., Lan, A.: Automatic short math answer grading via in-context meta-learning. International Educational Data Mining Society (2022)
Zhang, M., Wang, Z., Baraniuk, R., Lan, A.: Math operation embeddings for open-ended solution analysis and feedback. International Educational Data Mining Society (2021)
Zhang, T., Kishore, V., Wu, F., Weinberger, K.Q., Artzi, Y.: BERTScore: evaluating text generation with BERT. In: International Conference on Learning Representations (2020)
Ziegler, D.M., et al.: Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593 (2019)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Scarlatos, A., Smith, D., Woodhead, S., Lan, A. (2024). Improving the Validity of Automatically Generated Feedback via Reinforcement Learning. In: Olney, A.M., Chounta, IA., Liu, Z., Santos, O.C., Bittencourt, I.I. (eds) Artificial Intelligence in Education. AIED 2024. Lecture Notes in Computer Science(), vol 14829. Springer, Cham. https://doi.org/10.1007/978-3-031-64302-6_20
Download citation
DOI: https://doi.org/10.1007/978-3-031-64302-6_20
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-64301-9
Online ISBN: 978-3-031-64302-6
eBook Packages: Computer ScienceComputer Science (R0)