Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3636243.3636256acmotherconferencesArticle/Chapter ViewAbstractPublication Pagesaus-ceConference Proceedingsconference-collections
research-article
Open access

A Comparative Study of AI-Generated (GPT-4) and Human-crafted MCQs in Programming Education

Published: 29 January 2024 Publication History

Abstract

There is a constant need for educators to develop and maintain effective up-to-date assessments. While there is a growing body of research in computing education on utilizing large language models (LLMs) in generation and engagement with coding exercises, the use of LLMs for generating programming MCQs has not been extensively explored. We analyzed the capability of GPT-4 to produce multiple-choice questions (MCQs) aligned with specific learning objectives (LOs) from Python programming classes in higher education. Specifically, we developed an LLM-powered (GPT-4) system for generation of MCQs from high-level course context and module-level LOs. We evaluated 651 LLM-generated and 449 human-crafted MCQs aligned to 246 LOs from 6 Python courses. We found that GPT-4 was capable of producing MCQs with clear language, a single correct choice, and high-quality distractors. We also observed that the generated MCQs appeared to be well-aligned with the LOs. Our findings can be leveraged by educators wishing to take advantage of the state-of-the-art generative models to support MCQ authoring efforts.

References

[1]
Arav Agarwal, Karthik Mittal, Aidan Doyle, Pragnya Sridhar, Zipiao Wan, Jacob Doughty, Jaromir Savelka, and Majd Sakr. 2023. Understanding the Role of Temperature in Diverse Question Generation by GPT-4.
[2]
Ali Mohamed Nabil Allam and Mohamed Hassan Haggag. 2012. The question answering systems: A survey. International Journal of Research and Reviews in Information Sciences (IJRRIS) 2, 3 (2012).
[3]
Andrew C Butler. 2018. Multiple-choice testing in education: Are the best practices for assessment also good for learning?Journal of Applied Research in Memory and Cognition 7, 3 (2018), 323–331.
[4]
Dhawaleswar Rao Ch and Sujan Kumar Saha. 2018. Automatic multiple choice question generation from text: A survey. IEEE Transactions on Learning Technologies 13, 1 (2018), 14–25.
[5]
Ying-Hong Chan and Yao-Chung Fan. 2019. A recurrent BERT-based model for question generation. In Proceedings of the 2nd workshop on machine reading for question answering. 154–162.
[6]
Billy Ho Hung Cheung, Gary Kui Kai Lau, Gordon Tin Chun Wong, Elaine Yuen Phin Lee, Dhananjay Kulkarni, Choon Sheong Seow, Ruby Wong, and Michael Tiong Hong Co. 2023. ChatGPT versus human in generating medical graduate exam questions–An international prospective study. medRxiv (2023), 2023–05.
[7]
Woon Sang Cho, Yizhe Zhang, Sudha Rao, Asli Celikyilmaz, Chenyan Xiong, Jianfeng Gao, Mengdi Wang, and Bill Dolan. 2019. Contrastive multi-document question generation. arXiv preprint arXiv:1911.03047 (2019).
[8]
Paul Denny, John Hamer, Andrew Luxton-Reilly, and Helen Purchase. 2008. PeerWise: students sharing their multiple choice questions. In Proceedings of the fourth international workshop on computing education research. 51–58.
[9]
Paul Denny, Viraj Kumar, and Nasser Giacaman. 2022. Conversing with Copilot: Exploring Prompt Engineering for Solving CS1 Problems Using Natural Language. arxiv:2210.15157 [cs.HC]
[10]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
[11]
Xinya Du, Junru Shao, and Claire Cardie. 2017. Learning to Ask: Neural Question Generation for Reading Comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 1342–1352.
[12]
Nan Duan, Duyu Tang, Peng Chen, and Ming Zhou. 2017. Question generation for question answering. In Proceedings of the 2017 conference on empirical methods in natural language processing. 866–874.
[13]
Yifan Gao, Lidong Bing, Piji Li, Irwin King, and Michael R Lyu. 2019. Generating distractors for reading comprehension questions from real examinations. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 6423–6430.
[14]
Michael Heilman. 2011. Automatic factual question generation from text. Ph. D. Dissertation. Carnegie Mellon University.
[15]
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation 9, 8 (1997), 1735–1780.
[16]
Shu Jiang and John SY Lee. 2017. Distractor generation for chinese fill-in-the-blank items. In Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications. 143–148.
[17]
Dmytro Kalpakchi and Johan Boye. 2021. BERT-based distractor generation for Swedish reading comprehension questions using a small-scale dataset. arXiv preprint arXiv:2108.03973 (2021).
[18]
Majeed Kazemitabaar, Justin Chow, Carl Ka To Ma, Barbara J Ericson, David Weintrop, and Tovi Grossman. 2023. Studying the effect of AI Code Generators on Supporting Novice Learners in Introductory Programming. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems. 1–23.
[19]
David R Krathwohl. 2002. A revision of Bloom’s taxonomy: An overview. Theory into practice 41, 4 (2002), 212–218.
[20]
Archana Praveen Kumar, Ashalatha Nayak, Manjula Shenoy, Shashank Goyal, 2023. A novel approach to generate distractors for multiple choice questions. Expert Systems with Applications 225 (2023), 120022.
[21]
Ghader Kurdi, Jared Leo, Bijan Parsia, Uli Sattler, and Salam Al-Emari. 2020. A systematic review of automatic question generation for educational purposes. International Journal of Artificial Intelligence in Education 30 (2020), 121–204.
[22]
Daniel Leiker, Sara Finnigan, Ashley Ricker Gyllen, and Mutlu Cukurova. 2023. Prototyping the use of Large Language Models (LLMs) for adult learning content creation at scale. In LLM@AIED. https://api.semanticscholar.org/CorpusID:259076210
[23]
Juho Leinonen, Paul Denny, Stephen MacNeil, Sami Sarsa, Seth Bernstein, Joanne Kim, Andrew Tran, and Arto Hellas. 2023. Comparing Code Explanations Created by Students and Large Language Models. arxiv:2304.03938 [cs.CY]
[24]
Yuheng Li, Mladen Rakovic, Boon Xin Poh, Dragan Gasevic, and Guanliang Chen. 2022. Automatic Classification of Learning Objectives Based on Bloom’s Taxonomy. In EDM, Antonija Mitrovic and Nigel Bosch (Eds.). International EDM Society, Durham, United Kingdom, 530–537. https://doi.org/10.5281/zenodo.6853191
[25]
Chen Liang, Xiao Yang, Neisarg Dave, Drew Wham, Bart Pursel, and C Lee Giles. 2018. Distractor generation for multiple choice questions using learning to rank. In Proceedings of the thirteenth workshop on innovative use of NLP for building educational applications. 284–290.
[26]
Mark Liffiton, Brad Sheese, Jaromir Savelka, and Paul Denny. 2023. Codehelp: Using large language models with guardrails for scalable support in programming classes. arXiv preprint arXiv:2308.06921 (2023).
[27]
Luis Enrico Lopez, Diane Kathryn Cruz, Jan Christian Blaise Cruz, and Charibeth Cheng. 2020. Transformer-based end-to-end question generation. arXiv preprint arXiv:2005.01107 4 (2020).
[28]
Stephen MacNeil, Andrew Tran, Arto Hellas, Joanne Kim, Sami Sarsa, Paul Denny, Seth Bernstein, and Juho Leinonen. 2023. Experiences from Using Code Explanations Generated by Large Language Models in a Web Software Development E-Book(SIGCSE 2023). ACM, New York, NY, USA, 931–937. https://doi.org/10.1145/3545945.3569785
[29]
Stephen MacNeil, Andrew Tran, Dan Mogil, Seth Bernstein, Erin Ross, and Ziheng Huang. 2022. Generating Diverse Code Explanations Using the GPT-3 Large Language Model(ICER ’22). Association for Computing Machinery, New York, NY, USA, 3 pages. https://doi.org/10.1145/3501709.3544280
[30]
Vijaya Raju Madri and Sreenivasulu Meruva. 2023. A comprehensive review on MCQ generation from text. Multimedia Tools and Applications (2023), 1–20.
[31]
NEA Nasution. 2023. Using artificial intelligence to create biology multiple choice questions for higher education. Agricultural and Environmental Education 2, 1 (2023).
[32]
Jeroen Offerijns, Suzan Verberne, and Tessa Verhoef. 2020. Better distractions: Transformer-based distractor generation and multiple choice question filtering. arXiv preprint arXiv:2010.09598 (2020).
[33]
OpenAI. 2023. GPT-4 Technical Report. (2023). arxiv:2303.08774 [cs.CL]
[34]
Tung Phung, José Pablo Cambronero, Sumit Gulwani, Tobias Kohn, Rupak Majumdar, Adish Kumar Singla, and Gustavo Soares. 2023. Generating High-Precision Feedback for Programming Syntax Errors using Large Language Models. ArXiv abs/2302.04662 (2023).
[35]
Stephen R. Piccolo, Paul Denny, Andrew Luxton-Reilly, Samuel Payne, and Perry G. Ridge. 2023. Many bioinformatics programming tasks can be automated with ChatGPT. arxiv:2303.13528 [q-bio.OT]
[36]
James Prather, Paul Denny, Juho Leinonen, Brett A Becker, Ibrahim Albluwi, Michelle Craig, Hieke Keuning, Natalie Kiesler, Tobias Kohn, Andrew Luxton-Reilly, 2023. The robots are here: Navigating the generative ai revolution in computing education. arXiv preprint arXiv:2310.00658 (2023).
[37]
Xinying Qiu, Haiwei Xue, Lingfeng Liang, Zexin Xie, Shuxuan Liao, and Guofeng Shi. 2021. Automatic generation of multiple-choice cloze-test questions for lao language learning. In 2021 International Conference on Asian Language Processing (IALP). IEEE, 125–130.
[38]
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. (2019).
[39]
Siyu Ren and Kenny Q Zhu. 2021. Knowledge-driven distractor generation for cloze-style multiple choice questions. In Proceedings of the AAAI conference on artificial intelligence, Vol. 35. 4339–4347.
[40]
Ricardo Rodriguez-Torrealba, Eva Garcia-Lopez, and Antonio Garcia-Cabot. 2022. End-to-End generation of Multiple-Choice questions using Text-to-Text transfer Transformer models. Expert Systems with Applications 208 (2022), 118258.
[41]
Sami Sarsa, Paul Denny, Arto Hellas, and Juho Leinonen. 2022. Automatic Generation of Programming Exercises and Code Explanations Using Large Language Models. ACM. https://doi.org/10.1145/3501385.3543957
[42]
Jaromir Savelka, Arav Agarwal, Marshall An, Chris Bogart, and Majd Sakr. 2023. Thrilled by Your Progress! Large Language Models (GPT-4) No Longer Struggle to Pass Assessments in Higher Education Programming Courses. arXiv preprint arXiv:2306.10073 (2023).
[43]
Jaromir Savelka, Arav Agarwal, Christopher Bogart, and Majd Sakr. 2023. From GPT-3 to GPT-4: On the Evolving Efficacy of LLMs to Answer Multiple-choice Questions for Programming Classes in Higher Education. arXiv preprint arXiv:2311.09518 (2023).
[44]
Jaromir Savelka, Arav Agarwal, Christopher Bogart, and Majd Sakr. 2023. Large language models (gpt) struggle to answer multiple-choice questions about code. arXiv preprint arXiv:2303.08033 (2023).
[45]
Jaromir Savelka, Arav Agarwal, Christopher Bogart, Yifan Song, and Majd Sakr. 2023. Can Generative Pre-Trained Transformers (GPT) Pass Assessments in Higher Education Programming Courses?. In Proceedings of the 2023 Conference on Innovation and Technology in Computer Science Education V. 1 (Turku, Finland) (ITiCSE 2023). Association for Computing Machinery, New York, NY, USA, 117–123. https://doi.org/10.1145/3587102.3588792
[46]
Jaromir Savelka, Paul Denny, Mark Liffiton, and Brad Sheese. 2023. Efficient Classification of Student Help Requests in Programming Courses Using Large Language Models. arxiv:2310.20105 [cs.CY]
[47]
Brad Sheese, Mark Liffiton, Jaromir Savelka, and Paul Denny. 2023. Patterns of Student Help-Seeking When Using a Large Language Model-Powered Programming Assistant. arxiv:2310.16984 [cs.CY]
[48]
Jinnie Shin, Qi Guo, and Mark J Gierl. 2019. Multiple-choice item distractor development using topic modeling approaches. Frontiers in psychology 10 (2019), 825.
[49]
Marco Antonio Calijorne Soares and Fernando Silva Parreiras. 2020. A literature review on question answering techniques, paradigms and systems. Journal of King Saud University-Computer and Information Sciences 32, 6 (2020), 635–646.
[50]
Pragnya Sridhar, Aidan Doyle, Arav Agarwal, Christopher Bogart, Jaromir Savelka, and Majd Sakr. 2023. Harnessing llms in curricular design: Using gpt-4 to support authoring of learning objectives. arXiv preprint arXiv:2306.17459 (2023).
[51]
Duyu Tang, Nan Duan, Tao Qin, Zhao Yan, and Ming Zhou. 2017. Question answering and question generation as dual tasks. arXiv preprint arXiv:1706.02027 (2017).
[52]
Marcy H Towns. 2014. Guide to developing high-quality, reliable, and valid multiple-choice assessments. Journal of Chemical Education 91, 9 (2014), 1426–1431.
[53]
Andrew Tran, Kenneth Angelikas, Egi Rama, Chiku Okechukwu, David H Smith IV, and Stephen MacNeil. [n. d.]. Generating Multiple Choice Questions for Computing Courses using Large Language Models. ([n. d.]).
[54]
Des Traynor and J Paul Gibson. 2005. Synthesis and analysis of automatic assessment methods in CS1: generating intelligent MCQs. ACM SIGCSE Bulletin 37, 1 (2005), 495–499.
[55]
Kristiyan Vachev, Momchil Hardalov, Georgi Karadzhov, Georgi Georgiev, Ivan Koychev, and Preslav Nakov. 2022. Leaf: Multiple-choice question generation. In European Conference on Information Retrieval. Springer, 321–328.
[56]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017).
[57]
Zichao Wang, Andrew S Lan, Weili Nie, Andrew E Waters, Phillip J Grimaldi, and Richard G Baraniuk. 2018. QG-net: a data-driven question generation model for educational content. In Proceedings of the fifth annual ACM conference on learning at scale. 1–10.
[58]
N. Wongpakaran, T. Wongpakaran, D. Wedding, and K. Gwet. 2013. A comparison of Cohen’s Kappa and Gwet’s AC1 when calculating inter-rater reliability coefficients: a study conducted with personality disorder samples.BMC Med Res Methodol (2013). https://doi.org/10.1186/1471-2288-13-61
[59]
S. Zec, N. Soriani, R. Comoretto, and I. Baldi. 2017. High Agreement and High Prevalence: The Paradox of Cohen’s Kappa. The Open Nursing Journal (2017), 221–218. https://doi.org/10.2174/1874434601711010211
[60]
Xiaorui Zhou, Senlin Luo, and Yunfang Wu. 2020. Co-attention hierarchical network: Generating coherent long distractors for reading comprehension. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 9725–9732.

Cited By

View all
  • (2025) A comparative study of AI ‐generated and human‐crafted learning objectives in computing education Journal of Computer Assisted Learning10.1111/jcal.1309241:1Online publication date: 5-Jan-2025
  • (2024)ChatGPT-4 versus human generated multiple choice questions - A study from a medical college in PakistanJournal of Shalamar Medical & Dental College - JSHMDC10.53685/jshmdc.v5i2.2535:2(58-64)Online publication date: 31-Dec-2024
  • (2024)A Study on the Automatic Generation Methodology of NCS-based Job Competency Assessment Items Using Generative AI : Focused on GPT4-o based Information Security Analysis Competency UnitThe Journal of Korean Association of Computer Education10.32431/kace.2024.27.5.00227:5(13-25)Online publication date: 31-Aug-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
ACE '24: Proceedings of the 26th Australasian Computing Education Conference
January 2024
208 pages
ISBN:9798400716195
DOI:10.1145/3636243
This work is licensed under a Creative Commons Attribution International 4.0 License.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 29 January 2024

Check for updates

Author Tags

  1. Assessments
  2. Automated Content Generation
  3. Automatic Generation
  4. GPT-4
  5. LLMs
  6. LOs
  7. Large Language Models
  8. Learning Objectives
  9. MCQs
  10. Multiple-choice Questions

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

ACE 2024
ACE 2024: Australian Computing Education Conference
January 29 - February 2, 2024
NSW, Sydney, Australia

Acceptance Rates

Overall Acceptance Rate 161 of 359 submissions, 45%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)2,201
  • Downloads (Last 6 weeks)218
Reflects downloads up to 13 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2025) A comparative study of AI ‐generated and human‐crafted learning objectives in computing education Journal of Computer Assisted Learning10.1111/jcal.1309241:1Online publication date: 5-Jan-2025
  • (2024)ChatGPT-4 versus human generated multiple choice questions - A study from a medical college in PakistanJournal of Shalamar Medical & Dental College - JSHMDC10.53685/jshmdc.v5i2.2535:2(58-64)Online publication date: 31-Dec-2024
  • (2024)A Study on the Automatic Generation Methodology of NCS-based Job Competency Assessment Items Using Generative AI : Focused on GPT4-o based Information Security Analysis Competency UnitThe Journal of Korean Association of Computer Education10.32431/kace.2024.27.5.00227:5(13-25)Online publication date: 31-Aug-2024
  • (2024)Modelos de lenguaje para la generación de preguntas de programación con diferentes niveles de dificultadLanguage models for generating programming questions with varying difficulty levelsEuropean Public & Social Innovation Review10.31637/epsir-2024-7609(1-19)Online publication date: 12-Sep-2024
  • (2024)Risk management strategy for generative AI in computing education: how to handle the strengths, weaknesses, opportunities, and threats?International Journal of Educational Technology in Higher Education10.1186/s41239-024-00494-x21:1Online publication date: 11-Dec-2024
  • (2024)Automating Personalized Parsons Problems with Customized Contexts and ConceptsProceedings of the 2024 on Innovation and Technology in Computer Science Education V. 110.1145/3649217.3653568(688-694)Online publication date: 3-Jul-2024
  • (2024)A Benchmark for Testing the Capabilities of LLMs in Assessing the Quality of Multiple-choice Questions in Introductory Programming EducationProceedings of the 2024 on ACM Virtual Global Computing Education Conference V. 110.1145/3649165.3690123(193-199)Online publication date: 5-Dec-2024
  • (2024)Synthetic Students: A Comparative Study of Bug Distribution Between Large Language Models and Computing StudentsProceedings of the 2024 on ACM Virtual Global Computing Education Conference V. 110.1145/3649165.3690100(137-143)Online publication date: 5-Dec-2024
  • (2024)Automating Autograding: Large Language Models as Test Suite Generators for Introductory ProgrammingJournal of Computer Assisted Learning10.1111/jcal.1310041:1Online publication date: 25-Dec-2024
  • (2024)Optimizing Large Language Models for Auto-Generation of Programming Quizzes2024 IEEE Integrated STEM Education Conference (ISEC)10.1109/ISEC61299.2024.10665141(1-5)Online publication date: 9-Mar-2024
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media