Improving the Validity of Automatically Generated Feedback via Reinforcement Learning

Scarlatos, Alexander; Smith, Digory; Woodhead, Simon; Lan, Andrew

doi:10.1007/978-3-031-64302-6_20

Alexander Scarlatos¹²,
Digory Smith¹³,
Simon Woodhead¹³ &
…
Andrew Lan¹²

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14829))

Included in the following conference series:

International Conference on Artificial Intelligence in Education

2297 Accesses

Abstract

Automatically generating feedback via large language models (LLMs) in intelligent tutoring systems and online learning platforms has the potential to improve the learning outcomes of many students. However, both feedback generation and evaluation are challenging: feedback content has to be valid especially in subjects like math, which requires models to understand the problem, the solution, and where the student’s error lies. Feedback also has to be pedagogically valid to reflect effective tutoring strategies, such as explaining possible misconceptions and encouraging the student, among other desirable features. In this work, we address both problems of automatically generating and evaluating feedback while considering both correctness and alignment. First, we propose a rubric for evaluating math feedback and show that GPT-4 is able to effectively use it to annotate human-written and LLM-generated feedback. Second, we propose a framework for feedback generation that optimizes both correctness and alignment using reinforcement learning (RL). Specifically, we use GPT-4’s annotations to create preferences over feedback pairs in an augmented dataset for training via direct preference optimization (DPO). We show that our methods significantly increase the correctness and alignment of generated feedback with Llama 2, an open-source LLM, qualitatively analyze our generation and evaluation systems using case studies, and outline several areas for future work. (Our code is available at https://github.com/umass-ml4ed/feedback-gen-dpo).

The authors thank Schmidt Futures and the NSF (under grants IIS-2118706 and IIS-2237676) for partially supporting this work.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 74.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Enhancing LLM-Based Feedback: Insights from Intelligent Tutoring Systems and the Learning Sciences

Automated Personalized Feedback Improves Learning Gains in An Intelligent Tutoring System

Using Generative AI to Provide Feedback to Adult Tutors in Training and Assess Real-Life Performance

References

Al-Hossami, E., Bunescu, R., Teehan, R., Powell, L., Mahajan, K., Dorodchi, M.: Socratic questioning of novice debuggers: a benchmark dataset and preliminary evaluations. In: Proceedings of the Workshop on Innovative Use of NLP for Building Educational Applications (BEA2023@ACL), pp. 709–726 (2023)
Google Scholar
Boaler, J.: Ability and mathematics: the mindset revolution that is reshaping education. Forum 55, 143–152 (2013)
Article Google Scholar
Botelho, A., Baral, S., Erickson, J.A., Benachamardi, P., Heffernan, N.T.: Leveraging natural language processing to support automated assessment and feedback for student open responses in mathematics. J. Comput. Assist. Learn. 39(3), 823–840 (2023)
Article Google Scholar
Chen, M., et al.: Evaluating large language models trained on code (2021)
Google Scholar
Chen, W., Ma, X., Wang, X., Cohen, W.W.: Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. arXiv preprint arXiv:2211.12588 (2022)
Chiang, C.H., Lee, H.V.: Can large language models be an alternative to human evaluations? arXiv preprint arXiv:2305.01937 (2023)
Dettmers, T., Lewis, M., Belkada, Y., Zettlemoyer, L.: Llm.int8(): 8-bit matrix multiplication for transformers at scale (2022)
Google Scholar
Hu, E.J., et al.: Lora: Low-rank adaptation of large language models (2021)
Google Scholar
Jia, Q., Cui, J., Xiao, Y., Liu, C., Rashid, P., Gehringer, E.F.: All-in-one: multi-task learning BERT models for evaluating peer assessments. arXiv preprint arXiv:2110.03895 (2021)
Jia, Q., et al.: Insta-reviewer: a data-driven approach for generating instant feedback on students’ project reports. International Educational Data Mining Society (2022)
Google Scholar
Kakarla, S., Thomas, D., Lin, J., Gupta, S., Koedinger, K.R.: Using large language models to assess tutors’ performance in reacting to students making math errors. arXiv preprint arXiv:2401.03238 (2024)
Kochmar, E., Vu, D.D., Belfer, R., Gupta, V., Serban, I.V., Pineau, J.: Automated personalized feedback improves learning gains in an intelligent tutoring system. In: International Conference on Artificial Intelligence in Education. pp. 140–146 (2020)
Google Scholar
Kocmi, T., Federmann, C.: Large language models are state-of-the-art evaluators of translation quality. arXiv preprint arXiv:2302.14520 (2023)
Kojima, T., Gu, S.S., Reid, M., Matsuo, Y., Iwasawa, Y.: Large language models are zero-shot reasoners. Adv. Neural. Inf. Process. Syst. 35, 22199–22213 (2022)
Google Scholar
Lan, A.S., Vats, D., Waters, A.E., Baraniuk, R.G.: Mathematical language processing: automatic grading and feedback for open response mathematical questions. In: Proceedings of the ACM Conference on learning@scale, pp. 167–176 (2015)
Google Scholar
Lee, H., et al.: Rlaif: Scaling reinforcement learning from human feedback with AI feedback. arXiv preprint arXiv:2309.00267 (2023)
Lin, C.Y.: ROUGE: a package for automatic evaluation of summaries. In: Text Summarization Branches Out, Barcelona, Spain, pp. 74–81. Association for Computational Linguistics (2004)
Google Scholar
Liu, N., Sonkar, S., Wang, Z., Woodhead, S., Baraniuk, R.G.: Novice learner and expert tutor: evaluating math reasoning abilities of large language models with misconceptions. arXiv preprint arXiv:2310.02439 (2023)
Liu, N., Wang, Z., Baraniuk, R., Lan, A.: Open-ended knowledge tracing for computer science education. In: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 3849–3862 (2022)
Google Scholar
McNichols, H., et al.: Automated distractor and feedback generation for math multiple-choice questions via in-context learning. In: NeurIPS’23 Workshop on Generative AI for Education (2023)
Google Scholar
McNichols, H., Zhang, M., Lan, A.: Algebra error classification with large language models. In: International Conference on Artificial Intelligence in Education, pp. 365–376 (2023)
Google Scholar
Naismith, B., Mulcaire, P., Burstein, J.: Automated evaluation of written discourse coherence using GPT-4. In: Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023), Toronto, Canada, pp. 394–403. Association for Computational Linguistics (2023)
Google Scholar
Nguyen, H.A., Stec, H., Hou, X., Di, S., McLaren, B.M.: Evaluating chatgpt’s decimal skills and feedback generation in a digital learning game. In: Responsive and Sustainable Educational Futures, pp. 278–293 (2023)
Google Scholar
Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Manning, C.D., Finn, C.: Direct preference optimization: your language model is secretly a reward model (2023)
Google Scholar
Razzaq, R., Ostrow, K.S., Heffernan, N.T.: Effect of immediate feedback on math achievement at the high school level. In: Bittencourt, I.I., Cukurova, M., Muldner, K., Luckin, R., Millán, E. (eds.) AIED 2020. LNCS (LNAI), vol. 12164, pp. 263–267. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-52240-7_48
Chapter Google Scholar
Reimers, N., Gurevych, I.: Sentence-BERT: sentence embeddings using Siamese BERT-networks. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics (2019)
Google Scholar
Robinson, J.D., Chuang, C.Y., Sra, S., Jegelka, S.: Contrastive learning with hard negative samples. In: International Conference on Learning Representations (2021)
Google Scholar
Shinn, N., Cassano, F., Labash, B., Gopinath, A., Narasimhan, K., Yao, S.: Reflexion: language agents with verbal reinforcement learning. arXiv preprint arXiv:2303.1136614 (2023)
Singh, R., Gulwani, S., Solar-Lezama, A.: Automated feedback generation for introductory programming assignments. In: Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation, pp. 15–26 (2013)
Google Scholar
Song, D., Lee, W., Oh, H.: Context-aware and data-driven feedback generation for programming assignments. In: Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pp. 328–340 (2021)
Google Scholar
Sonkar, S., Le, M., Chen, X., Liu, N., Mallick, D.B., Baraniuk, R.G.: Code soliloquies for accurate calculations in large language models. arXiv preprint arXiv:2309.12161 (2023)
Steiss, J., et al.: Comparing the quality of human and ChatGPT feedback on students’ writing (2023)
Google Scholar
Sun, K.L.: Brief report: the role of mathematics teaching in fostering student growth mindset. J. Res. Math. Educ. 49(3), 330–335 (2018)
Article Google Scholar
Touvron, H., et al.: Llama 2: open foundation and fine-tuned chat models (2023)
Google Scholar
Wolf, T., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019)
Zhang, M., Baral, S., Heffernan, N., Lan, A.: Automatic short math answer grading via in-context meta-learning. International Educational Data Mining Society (2022)
Google Scholar
Zhang, M., Wang, Z., Baraniuk, R., Lan, A.: Math operation embeddings for open-ended solution analysis and feedback. International Educational Data Mining Society (2021)
Google Scholar
Zhang, T., Kishore, V., Wu, F., Weinberger, K.Q., Artzi, Y.: BERTScore: evaluating text generation with BERT. In: International Conference on Learning Representations (2020)
Google Scholar
Ziegler, D.M., et al.: Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593 (2019)

Download references

Author information

Authors and Affiliations

University of Massachusetts Amherst, Amherst, USA
Alexander Scarlatos & Andrew Lan
Eedi, London, UK
Digory Smith & Simon Woodhead

Authors

Alexander Scarlatos
View author publications
You can also search for this author in PubMed Google Scholar
Digory Smith
View author publications
You can also search for this author in PubMed Google Scholar
Simon Woodhead
View author publications
You can also search for this author in PubMed Google Scholar
Andrew Lan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Alexander Scarlatos .

Editor information

Editors and Affiliations

University of Memphis, Memphis, TN, USA
Andrew M. Olney
University of Duisburg-Essen, Duisburg, Germany
Irene-Angelica Chounta
Jinan University, Guangzhou, China
Zitao Liu
UNED, Madrid, Spain
Olga C. Santos
Universidade Federal de Alagoas, Maceio, Brazil
Ig Ibert Bittencourt

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Scarlatos, A., Smith, D., Woodhead, S., Lan, A. (2024). Improving the Validity of Automatically Generated Feedback via Reinforcement Learning. In: Olney, A.M., Chounta, IA., Liu, Z., Santos, O.C., Bittencourt, I.I. (eds) Artificial Intelligence in Education. AIED 2024. Lecture Notes in Computer Science(), vol 14829. Springer, Cham. https://doi.org/10.1007/978-3-031-64302-6_20

Download citation

DOI: https://doi.org/10.1007/978-3-031-64302-6_20
Published: 02 July 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-64301-9
Online ISBN: 978-3-031-64302-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Improving the Validity of Automatically Generated Feedback via Reinforcement Learning