Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Improving the Validity of Automatically Generated Feedback via Reinforcement Learning

  • Conference paper
  • First Online:
Artificial Intelligence in Education (AIED 2024)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14829))

Included in the following conference series:

Abstract

Automatically generating feedback via large language models (LLMs) in intelligent tutoring systems and online learning platforms has the potential to improve the learning outcomes of many students. However, both feedback generation and evaluation are challenging: feedback content has to be valid especially in subjects like math, which requires models to understand the problem, the solution, and where the student’s error lies. Feedback also has to be pedagogically valid to reflect effective tutoring strategies, such as explaining possible misconceptions and encouraging the student, among other desirable features. In this work, we address both problems of automatically generating and evaluating feedback while considering both correctness and alignment. First, we propose a rubric for evaluating math feedback and show that GPT-4 is able to effectively use it to annotate human-written and LLM-generated feedback. Second, we propose a framework for feedback generation that optimizes both correctness and alignment using reinforcement learning (RL). Specifically, we use GPT-4’s annotations to create preferences over feedback pairs in an augmented dataset for training via direct preference optimization (DPO). We show that our methods significantly increase the correctness and alignment of generated feedback with Llama 2, an open-source LLM, qualitatively analyze our generation and evaluation systems using case studies, and outline several areas for future work. (Our code is available at https://github.com/umass-ml4ed/feedback-gen-dpo).

The authors thank Schmidt Futures and the NSF (under grants IIS-2118706 and IIS-2237676) for partially supporting this work.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 74.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 89.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Al-Hossami, E., Bunescu, R., Teehan, R., Powell, L., Mahajan, K., Dorodchi, M.: Socratic questioning of novice debuggers: a benchmark dataset and preliminary evaluations. In: Proceedings of the Workshop on Innovative Use of NLP for Building Educational Applications (BEA2023@ACL), pp. 709–726 (2023)

    Google Scholar 

  2. Boaler, J.: Ability and mathematics: the mindset revolution that is reshaping education. Forum 55, 143–152 (2013)

    Article  Google Scholar 

  3. Botelho, A., Baral, S., Erickson, J.A., Benachamardi, P., Heffernan, N.T.: Leveraging natural language processing to support automated assessment and feedback for student open responses in mathematics. J. Comput. Assist. Learn. 39(3), 823–840 (2023)

    Article  Google Scholar 

  4. Chen, M., et al.: Evaluating large language models trained on code (2021)

    Google Scholar 

  5. Chen, W., Ma, X., Wang, X., Cohen, W.W.: Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. arXiv preprint arXiv:2211.12588 (2022)

  6. Chiang, C.H., Lee, H.V.: Can large language models be an alternative to human evaluations? arXiv preprint arXiv:2305.01937 (2023)

  7. Dettmers, T., Lewis, M., Belkada, Y., Zettlemoyer, L.: Llm.int8(): 8-bit matrix multiplication for transformers at scale (2022)

    Google Scholar 

  8. Hu, E.J., et al.: Lora: Low-rank adaptation of large language models (2021)

    Google Scholar 

  9. Jia, Q., Cui, J., Xiao, Y., Liu, C., Rashid, P., Gehringer, E.F.: All-in-one: multi-task learning BERT models for evaluating peer assessments. arXiv preprint arXiv:2110.03895 (2021)

  10. Jia, Q., et al.: Insta-reviewer: a data-driven approach for generating instant feedback on students’ project reports. International Educational Data Mining Society (2022)

    Google Scholar 

  11. Kakarla, S., Thomas, D., Lin, J., Gupta, S., Koedinger, K.R.: Using large language models to assess tutors’ performance in reacting to students making math errors. arXiv preprint arXiv:2401.03238 (2024)

  12. Kochmar, E., Vu, D.D., Belfer, R., Gupta, V., Serban, I.V., Pineau, J.: Automated personalized feedback improves learning gains in an intelligent tutoring system. In: International Conference on Artificial Intelligence in Education. pp. 140–146 (2020)

    Google Scholar 

  13. Kocmi, T., Federmann, C.: Large language models are state-of-the-art evaluators of translation quality. arXiv preprint arXiv:2302.14520 (2023)

  14. Kojima, T., Gu, S.S., Reid, M., Matsuo, Y., Iwasawa, Y.: Large language models are zero-shot reasoners. Adv. Neural. Inf. Process. Syst. 35, 22199–22213 (2022)

    Google Scholar 

  15. Lan, A.S., Vats, D., Waters, A.E., Baraniuk, R.G.: Mathematical language processing: automatic grading and feedback for open response mathematical questions. In: Proceedings of the ACM Conference on learning@scale, pp. 167–176 (2015)

    Google Scholar 

  16. Lee, H., et al.: Rlaif: Scaling reinforcement learning from human feedback with AI feedback. arXiv preprint arXiv:2309.00267 (2023)

  17. Lin, C.Y.: ROUGE: a package for automatic evaluation of summaries. In: Text Summarization Branches Out, Barcelona, Spain, pp. 74–81. Association for Computational Linguistics (2004)

    Google Scholar 

  18. Liu, N., Sonkar, S., Wang, Z., Woodhead, S., Baraniuk, R.G.: Novice learner and expert tutor: evaluating math reasoning abilities of large language models with misconceptions. arXiv preprint arXiv:2310.02439 (2023)

  19. Liu, N., Wang, Z., Baraniuk, R., Lan, A.: Open-ended knowledge tracing for computer science education. In: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 3849–3862 (2022)

    Google Scholar 

  20. McNichols, H., et al.: Automated distractor and feedback generation for math multiple-choice questions via in-context learning. In: NeurIPS’23 Workshop on Generative AI for Education (2023)

    Google Scholar 

  21. McNichols, H., Zhang, M., Lan, A.: Algebra error classification with large language models. In: International Conference on Artificial Intelligence in Education, pp. 365–376 (2023)

    Google Scholar 

  22. Naismith, B., Mulcaire, P., Burstein, J.: Automated evaluation of written discourse coherence using GPT-4. In: Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023), Toronto, Canada, pp. 394–403. Association for Computational Linguistics (2023)

    Google Scholar 

  23. Nguyen, H.A., Stec, H., Hou, X., Di, S., McLaren, B.M.: Evaluating chatgpt’s decimal skills and feedback generation in a digital learning game. In: Responsive and Sustainable Educational Futures, pp. 278–293 (2023)

    Google Scholar 

  24. Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Manning, C.D., Finn, C.: Direct preference optimization: your language model is secretly a reward model (2023)

    Google Scholar 

  25. Razzaq, R., Ostrow, K.S., Heffernan, N.T.: Effect of immediate feedback on math achievement at the high school level. In: Bittencourt, I.I., Cukurova, M., Muldner, K., Luckin, R., Millán, E. (eds.) AIED 2020. LNCS (LNAI), vol. 12164, pp. 263–267. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-52240-7_48

    Chapter  Google Scholar 

  26. Reimers, N., Gurevych, I.: Sentence-BERT: sentence embeddings using Siamese BERT-networks. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics (2019)

    Google Scholar 

  27. Robinson, J.D., Chuang, C.Y., Sra, S., Jegelka, S.: Contrastive learning with hard negative samples. In: International Conference on Learning Representations (2021)

    Google Scholar 

  28. Shinn, N., Cassano, F., Labash, B., Gopinath, A., Narasimhan, K., Yao, S.: Reflexion: language agents with verbal reinforcement learning. arXiv preprint arXiv:2303.1136614 (2023)

  29. Singh, R., Gulwani, S., Solar-Lezama, A.: Automated feedback generation for introductory programming assignments. In: Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation, pp. 15–26 (2013)

    Google Scholar 

  30. Song, D., Lee, W., Oh, H.: Context-aware and data-driven feedback generation for programming assignments. In: Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pp. 328–340 (2021)

    Google Scholar 

  31. Sonkar, S., Le, M., Chen, X., Liu, N., Mallick, D.B., Baraniuk, R.G.: Code soliloquies for accurate calculations in large language models. arXiv preprint arXiv:2309.12161 (2023)

  32. Steiss, J., et al.: Comparing the quality of human and ChatGPT feedback on students’ writing (2023)

    Google Scholar 

  33. Sun, K.L.: Brief report: the role of mathematics teaching in fostering student growth mindset. J. Res. Math. Educ. 49(3), 330–335 (2018)

    Article  Google Scholar 

  34. Touvron, H., et al.: Llama 2: open foundation and fine-tuned chat models (2023)

    Google Scholar 

  35. Wolf, T., et al.: Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019)

  36. Zhang, M., Baral, S., Heffernan, N., Lan, A.: Automatic short math answer grading via in-context meta-learning. International Educational Data Mining Society (2022)

    Google Scholar 

  37. Zhang, M., Wang, Z., Baraniuk, R., Lan, A.: Math operation embeddings for open-ended solution analysis and feedback. International Educational Data Mining Society (2021)

    Google Scholar 

  38. Zhang, T., Kishore, V., Wu, F., Weinberger, K.Q., Artzi, Y.: BERTScore: evaluating text generation with BERT. In: International Conference on Learning Representations (2020)

    Google Scholar 

  39. Ziegler, D.M., et al.: Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593 (2019)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Alexander Scarlatos .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Scarlatos, A., Smith, D., Woodhead, S., Lan, A. (2024). Improving the Validity of Automatically Generated Feedback via Reinforcement Learning. In: Olney, A.M., Chounta, IA., Liu, Z., Santos, O.C., Bittencourt, I.I. (eds) Artificial Intelligence in Education. AIED 2024. Lecture Notes in Computer Science(), vol 14829. Springer, Cham. https://doi.org/10.1007/978-3-031-64302-6_20

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-64302-6_20

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-64301-9

  • Online ISBN: 978-3-031-64302-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics