Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Advertisement

Exploring quality criteria and evaluation methods in automated question generation: A comprehensive survey

  • Published:
Education and Information Technologies Aims and scope Submit manuscript

Abstract

In light of the widespread adoption of technology-enhanced learning and assessment platforms, there is a growing demand for innovative, high-quality, and diverse assessment questions. Automatic Question Generation (AQG) has emerged as a valuable solution, enabling educators and assessment developers to efficiently produce a large volume of test items, questions, or assessments within a short timeframe. AQG leverages computer algorithms to automatically generate questions, streamlining the question-generation process. Despite the efficiency gains, significant gaps in the question-generation pipeline hinder the seamless integration of AQG systems into the assessment process. Notably, the absence of a standardized evaluation framework poses a substantial challenge in assessing the quality and usability of automatically generated questions. This study addresses this gap by conducting a comprehensive survey of existing question evaluation methods, a crucial step in refining the question generation pipeline. Subsequently, we present a taxonomy for these evaluation methods, shedding light on their respective advantages and limitations within the AQG context. The study concludes by offering recommendations for future research to enhance the effectiveness of AQG systems in educational assessments.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Data availability

The manuscript has no associated data.

References

  • Adegoke, B. A. (2013). Comparison of item statistics of physics achievement test using classical test and item response theory frameworks. Journal of Education and Practice, 4(22), 87–96.

    Google Scholar 

  • American Educational Research Association, American Psychological Association, National Council on Measurement in Education. (2014). Standards for educational and psychological testing. American Educational Research Association.

    Google Scholar 

  • Amidei, J., Piwek, P., & Willis, A. (2018). Evaluation methodologies in automatic question generation 2013-2018. Proceedings of The 11th International Natural Language Generation Conference (pp. 307–317). https://doi.org/10.18653/v1/W18-6537

    Chapter  Google Scholar 

  • Anastasi, A., & Urbina, S. (2004). Psychological testing (7th ed.). Pearson.

    Google Scholar 

  • Ashraf, Z. A. (2020). Classical and modern methods in item analysis of test tools. International Journal of Research and Review, 7(5), 397–403.

    Google Scholar 

  • Attali, Y., Runge, A., LaFlair, G. T., Yancey, K., Goodwin, S., Park, Y., & von Davier, A. A. (2022). The interactive reading task: Transformer-based automatic item generation. Frontiers in Artificial Intelligence, 5, 903077. https://doi.org/10.3389/frai.2022.903077.

    Article  Google Scholar 

  • Baker, F. B. (2001). The basics of item response theory (2nd ed.). ERIC Clearinghouse on Assessment and Evaluation.

    Google Scholar 

  • Bandalos, D. L. (2018). Measurement theory and applications for the social sciences. Guilford Publications.

    Google Scholar 

  • Banerjee, S., & Lavie, A. (2005). METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. Proceedings of the ACL Workshop onIntrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pp 65–72.

    Google Scholar 

  • Becker, L., Basu, S., & Vanderwende, L. (2012). Mind the gap: Learning to choose gaps for question generation. Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 742–751.

  • Bichi, A. A. (2016). Classical Test Theory: An introduction to linear modeling approach to test and item analysis. International Journal for Social Studies, 2(9), 27–33.

    Google Scholar 

  • Bulut, O., & Suh, Y. (2017). Detecting DIF in multidimensional assessments with the MIMIC model, the IRT likelihood ratio test, and logistic regression. Frontiers in Education, 2(51), 1–14. https://doi.org/10.3389/feduc.2017.00051.

    Article  Google Scholar 

  • Chalifour, C. L., & Powers, D. E. (1989). The relationship of content characteristics of GRE analytical reasoning items to their difficulties and discriminations. Journal of Educational Measurement, 26(2), 120–132. https://doi.org/10.1111/j.1745-3984.1989.tb00323.x.

    Article  Google Scholar 

  • Chughtai, R., Azam, F., Anwar, M. W., Haider But, W., & Farooq, M. U. (2022). A lecture-centric automated distractor generation for post-graduate software engineering courses. International Conference on Frontiers of Information Technology (FIT), 2022, 100–105. https://doi.org/10.1109/FIT57066.2022.00028.

    Article  Google Scholar 

  • Chung, C.-Y., & Hsiao, I.-H. (2022). Programming Question Generation by a Semantic Network: A Preliminary User Study with Experienced Instructors. In M. M. Rodrigo, N. Matsuda, A. I. Cristea, & V. Dimitrova (Eds.), Artificial Intelligence in Education. Posters and Late Breaking Results, Workshops and Tutorials, Industry and Innovation Tracks, Practitioners’ and Doctoral Consortium (Vol. 13356, pp. 463–466). Springer International Publishing. https://doi.org/10.1007/978-3-031-11647-6_93.

  • Clauser, J. C., & Hambleton, R. K. (2011). Item analysis procedures for classroom assessments in higher education. In C. Secolsky & D. B. Denison (Eds.), Handbook on Measurement, Assessment, and Evaluation in Higher Education (pp. 296–309). Routledge.

    Google Scholar 

  • Cohen, R. J., Swerdlik, M. E., & Phillips, S. M. (1996). Psychological testing and assessment: An introduction to tests and measurement (3rd ed.). Mayfield Publishing Co.

    Google Scholar 

  • Darling-Hammond, L., Herman, J., Pellegrino, J., Abedi, J., Aber, J. L., Baker, E., … & Steele, C. M. (2013). Criteria for high-quality assessment. Stanford Center for Opportunity Policy in Education, 2, 171–192.

  • DeMars, C. (2010). Item response theory. Oxford University Press. https://doi.org/10.1093/acprof:oso/9780195377033.001.0001.

    Article  Google Scholar 

  • Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. https://doi.org/10.48550/arXiv.1810.04805

  • Dugan, L., Miltsakaki, E., Upadhyay, S., Ginsberg, E., Gonzalez, H., Choi, D., Yuan, C., & Callison-Burch, C. (2022). A feasibility study of answer-agnostic question generation for education. Findings of the Association for Computational Linguistics: ACL, 2022, 1919–1926.

    Google Scholar 

  • Ebel, R. L., & Frisbie, D. A. (1986). Using test and item analysis to evaluate and improve test quality. Essentials of educational measurement (Vol. 4, pp. 223–242). Prentice-Hall.

    Google Scholar 

  • Engelhard, G., Jr., Davis, M., & Hansche, L. (1999). Evaluating the accuracy of judgments obtained from item review committees. Applied Measurement in Education, 12(2), 199–210. https://doi.org/10.1207/s15324818ame1202_6.

    Article  Google Scholar 

  • Ewell, P. T. (2008). Assessment and accountability in America today: Background and context. New Directions for Institutional Research, 2008(S1), 7–17. https://doi.org/10.1002/ir.258.

    Article  Google Scholar 

  • French, C. L. (2001). A review of classical methods of item analysis [Paper presentation]. Annual meeting of the Southwest Educational Research Association, New Orleans, LA, USA.

  • Fu, Y., Choe, E. M., Lim, H., & Choi, J. (2022). An Evaluation of Automatic Item Generation: A Case Study of Weak Theory Approach. Educational Measurement: Issues and Practice, 41(4), 10–22. https://doi.org/10.1111/emip.12529.

    Article  Google Scholar 

  • Gao, Y., Bing, L., Chen, W., Lyu, M. R., & King, I. (2019). Difficulty controllable generation of reading comprehension questions. arXiv. http://arxiv.org/abs/1807.03586. Accessed 04/04/2023.

  • Gatt, A., & Krahmer, E. (2018). Survey of the state of the art in natural language generation: Core tasks, applications and evaluation. Journal of Artificial Intelligence Research, 61, 65–170. https://doi.org/10.1613/jair.5477.

    Article  MathSciNet  Google Scholar 

  • Gierl, M. J., & Lai, H. (2012). The role of item models in automatic item generation. International Journal of Testing, 12(3), 273–298. https://doi.org/10.1080/15305058.2011.635830

    Article  Google Scholar 

  • Gierl, M. J., Lai, H., & Tanygin, V. (2021). Methods for validating generated items: A focus on model-level outcomes. In Advanced Methods in Automatic Item Generation (1st ed., pp. 120–143). Routledge. https://doi.org/10.4324/9781003025634.

  • Gierl, M. J., Lai, H., Pugh, D., Touchie, C., Boulais, A.-P., & De Champlain, A. (2016). Evaluating the psychometric characteristics of generated multiple-choice test items. Applied Measurement in Education, 29(3), 196–210. https://doi.org/10.1080/08957347.2016.1171768.

    Article  Google Scholar 

  • Gierl, M. J., Swygert, K., Matovinovic, D., Kulesher, A., & Lai, H. (2022). Three sources of validation evidence are needed to evaluate the quality of generated test items for medical licensure. Teaching and Learning in Medicine, 1–11. https://doi.org/10.1080/10401334.2022.2119569.

  • Gorgun, G., & Bulut, O. (2021). A polytomous scoring approach to handle not-reached items in low-stakes assessments. Educational and Psychological Measurement, 81(5), 847–871. https://doi.org/10.1177/0013164421991211.

    Article  Google Scholar 

  • Gorgun, G., & Bulut, O. (2022). Considering disengaged responses in Bayesian and deep knowledge tracing. In M. M. Rodrigo, N. Matsuda, A. I. Cristea, & V. Dimitrova (Eds.), Artificial intelligence in education. Posters and late-breaking results, workshops and tutorials, industry and innovation Tracks, practitioners’ and doctoral consortium (pp. 591–594). Lecture Notes in Computer Science, vol 13356. Springer. https://doi.org/10.1007/978-3-031-11647-6_122.

  • Gorgun, G., & Bulut, O. (2023). Incorporating test-taking engagement into the item selection algorithm in low-stakes computerized adaptive tests. Large-Scale Assessments in Education, 11(1), 27. https://doi.org/10.1186/s40536-023-00177-5

    Article  Google Scholar 

  • Ha, L. A., & Yaneva, V. (2018). Automatic distractor suggestion for multiple-choice tests using concept embeddings and information retrieval. Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications, pp 389–398. https://doi.org/10.18653/v1/W18-0548.

  • Haladyna, T. M., & Rodriguez, M. C. (2021). Using full-information item analysis to improve item quality. Educational Assessment, 26(3), 198–211. https://doi.org/10.1080/10627197.2021.1946390.

    Article  Google Scholar 

  • Haladyna, T. M., Downing, S. M., & Rodriguez, M. C. (2002). A review of multiple-choice item writing guidelines for classroom assessment. Applied Measurement in Education, 15(3), 309–333. https://doi.org/10.1207/S15324818AME1503_5.

    Article  Google Scholar 

  • Hambleton, R. K., Swaminathan, H., & Rogers, H. J. (1991). Fundamentals of item response theory. Sage.

    Google Scholar 

  • Heilman, M. (2011). Automatic factual question generation from text [Ph. D.]. Carnegie Mellon University.

    Google Scholar 

  • Heilman, M., & Smith, N. A. (2010). Good question! Statistical ranking for question generation. Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp 609–617.

    Google Scholar 

  • Henning, G. (1987). A guide to language testing: Development, evaluation, research. Newberry House Publishers.

    Google Scholar 

  • Heubert, J. P., & Hauser, R. M. (Eds.). (1999). High stakes: Testing for tracking, promotion, and graduation. National Academy Press.

  • Hommel, B. E., Wollang, F.-J.M., Kotova, V., Zacher, H., & Schmukle, S. C. (2022). Transformer-based deep neural language modeling for construct-specific automatic item generation. Psychometrika, 87(2), 749–772. https://doi.org/10.1007/s11336-021-09823-9.

    Article  MathSciNet  Google Scholar 

  • Hovy, E. (1999). Toward finely differentiated evaluation metrics for machine translation. Proceedings of the EAGLES Workshop on Standards and Evaluation Pisa, Italy, 1999. https://cir.nii.ac.jp/crid/1571417125255458048https://doi.org/10.18653/v1/2022.acl-srw.31.

  • Huang, Y., & He, L. (2016). Automatic generation of short answer questions for reading comprehension assessment. Natural Language Engineering, 22(3), 457–489. https://doi.org/10.1017/S1351324915000455.

    Article  Google Scholar 

  • Huang, Y. T., & Mostow, J. (2015). Evaluating human and automated generation of distractors for diagnostic multiple-choice cloze questions to assess children’s reading comprehension. In C. Conati, N. Heffernan, A. Mitrovic, & M. Verdejo (Eds.), Artificial Intelligence in Education. AIED 2015. Lecture Notes in Computer Science. (Vol. 9112). Cham: Springer. https://doi.org/10.1007/978-3-319-19773-9_16

    Chapter  Google Scholar 

  • Impara, J. C., & Plake, B. S. (1998). Teachers’ ability to estimate item difficulty: A test of the assumptions in the Angoff standard setting method. Journal of Educational Measurement, 35(1), 69–81. https://doi.org/10.1111/j.1745-3984.1998.tb00528.x.

    Article  Google Scholar 

  • Jenkins, H. M., & Michael, M. M. (1986). Using and interpreting item analysis data. Nurse Educator, 11(1), 10.

    Article  Google Scholar 

  • Jouault, C., Seta, K., & Hayashi, Y. (2016). Content-dependent question generation using LOD for history learning in open learning space. New Generation Computing, 34(4), 367–394. https://doi.org/10.1007/s00354-016-0404-x.

    Article  Google Scholar 

  • Kehoe, J. (1995). Basic item analysis for multiple-choice tests. Practical Assessment, Research, and Evaluation, 4(10), 1–3. https://doi.org/10.7275/07zg-h235.

    Article  Google Scholar 

  • Kim, S.-H., Cohen, A. S., & Eom, H. J. (2021). A note on the three methods of item analysis. Behaviormetrika, 48(2), 345–367. https://doi.org/10.1007/s41237-021-00131-1.

    Article  Google Scholar 

  • Kim, S., & Feldt, L. S. (2010). The estimation of the IRT reliability coefficient and its lower and upper bounds, with comparisons to CTT reliability statistics. Asia Pacific Education Review, 11, 179–188. https://doi.org/10.1007/s12564-009-9062-8.

    Article  Google Scholar 

  • Kumar, V., Boorla, K., Meena, Y., Ramakrishnan, G., & Li, Y.-F. (2018). Automating reading comprehension by generating question and answer pairs (arXiv:1803.03664). arXiv. http://arxiv.org/abs/1803.03664.

  • Kurdi, G., Leo, J., Parsia, B., Sattler, U., & Al-Emari, S. (2020). A systematic review of automatic question generation for educational purposes. International Journal of Artificial Intelligence in Education, 30(1), 121–204. https://doi.org/10.1007/s40593-019-00186-y.

    Article  Google Scholar 

  • Lane, S., Raymond, M. R., & Haladyna, T. M. (Eds.). (2016). Handbook of test development (2nd ed.). Routledge.

    Google Scholar 

  • Liang, C., Yang, X., Dave, N., Wham, D., Pursel, B., & Giles, C. L. (2018). Distractor generation for multiple choice questions using learning to rank. Proceedings of the thirteenth workshop on innovative use of NLP for building educational applications, pp. 284–290.

    Chapter  Google Scholar 

  • Liang, C., Yang, X., Wham, D., Pursel, B., Passonneaur, R., & Giles, C. L. (2017). Distractor generation with generative adversarial nets for automatically creating fill-in-the-blank questions. Proceedings of the Knowledge Capture Conference, 1–4. https://doi.org/10.1145/3148011.315446.

  • Lin, C., Liu, D., Pang, W., & Apeh, E. (2015). Automatically predicting quiz difficulty level using similarity measures. Proceedings of the 8th International Conference on Knowledge Capture, 1–8. https://doi.org/10.1145/2815833.2815842.

  • Lin, C.-Y. (2004). ROUGE: A package for automatic evaluation of summaries. Text Summarization Branches Out, 74–81.

  • Linn, R. L. (2003). Accountability: Responsibility and reasonable expectations. Educational Researcher, 32(7), 3–13. https://doi.org/10.3102/0013189X032007003.

    Article  Google Scholar 

  • Liu, M., Rus, V., & Liu, L. (2017). Automatic Chinese factual question generation. IEEE Transactions on Learning Technologies, 10(2), 194–204.

    Article  Google Scholar 

  • Livingston, S. A. (2013). Item analysis. Routledge. https://doi.org/10.4324/9780203874776.ch19.

    Book  Google Scholar 

  • Marrese-Taylor, E., Nakajima, A., Matsuo, Y., & Yuichi, O. (2018). Learning to automatically generate fill-in-the-blank quizzes. arXiv. http://arxiv.org/abs/1806.04524.

  • Maurya, K. K., & Desarkar, M. S. (2020). Learning to distract: A hierarchical multi-decoder network for automated generation of long distractors for multiple-choice questions for reading comprehension. Proceedings of the 29th ACM International Conference on Information & Knowledge Management, 1115–1124. https://doi.org/10.1145/3340531.3411997.

  • McCarthy, A. D., Yancey, K. P., LaFlair, G. T., Egbert, J., Liao, M., & Settles, B. (2021). Jump-starting item parameters for adaptive language tests. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 883–899. https://doi.org/10.18653/v1/2021.emnlp-main.67.

  • Merriam-Webster. (2023). Metric. In Merriam-Webster.com dictionary. Retrieved November 3, 2023, from https://www.merriam-webster.com/dictionary/metric. Accessed 18 Sept 2023.

  • Mostow, J., Huang, Y.-T., Jang, H., Weinstein, A., Valeri, J., & Gates, D. (2017). Developing, evaluating, and refining an automatic generator of diagnostic multiple-choice cloze questions to assess children’s comprehension while reading. Natural Language Engineering, 23(2), 245–294. https://doi.org/10.1017/S1351324916000024.

    Article  Google Scholar 

  • Mulla, N., & Gharpure, P. (2023). Automatic question generation: A review of methodologies, datasets, evaluation metrics, and applications. Progress in Artificial Intelligence, 12(1), 1–32. https://doi.org/10.1007/s13748-023-00295-9.

    Article  Google Scholar 

  • Nagy, P. (2000). The three roles of assessment: Gatekeeping, accountability, and instructional diagnosis. Canadian Journal of Education / Revue Canadienne De L’éducation, 25(4), 262–279. https://doi.org/10.2307/1585850.

    Article  Google Scholar 

  • Nelson, D. (2004). The penguin dictionary of statistics. Penguin Books.

    Google Scholar 

  • Newton, P. E. (2007). Clarifying the purposes of educational assessment. Assessment in Education: Principles, Policy & Practice, 14(2), 149–170. https://doi.org/10.1080/09695940701478321.

    Article  Google Scholar 

  • Niraula, N. B., & Rus, V. (2015). Judging the quality of automatically generated gap-fill questions using active learning. Proceedings of the Tenth Workshop on Innovative Use of NLP for Building Educational Applications, 196–206. https://doi.org/10.3115/v1/W15-0623.

  • OECD. (2020). PISA 2022 technical standards. OECD Publishing.

    Google Scholar 

  • Olney, A. M. (2021). Sentence selection for cloze item creation: A standardized task and preliminary results. Joint Proceedings of the Workshops at the 14th International Conference on Educational Data Mining, pp 1–5.

  • Osterlind, S. J. (1989). Judging the quality of test items: Item analysis. In S. J. Osterlind (Ed.), Constructing Test Items (pp. 259–310). Springer. https://doi.org/10.1007/978-94-009-1071-3_7.

  • Osterlind, S. J., & Everson, H. T. (2009). Differential item functioning. Sage Publications.

    Book  Google Scholar 

  • Osterlind, S. J., & Wang, Z. (2017). Item response theory in measurement, assessment, and evaluation for higher education. In C. Secolsky & D. B. Denison (Eds.), Handbook on measurement, assessment, and evaluation in higher education (pp. 191–200). Routledge.

    Chapter  Google Scholar 

  • Panda, S., Palma Gomez, F., Flor, M., & Rozovskaya, A. (2022). Automatic generation of distractors for fill-in-the-blank exercises with round-trip neural machine translation. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop, 391–401.

  • Papineni, K., Roukos, S., Ward, T., & Zhu, W.-J. (2002). Bleu: A Method for Automatic Evaluation of Machine Translation. Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, 311–318. https://doi.org/10.3115/1073083.1073135.

  • Pennington, J., Socher, R., & Manning, D. (2014, October). Glove: Global vectors for word representation. Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp 1532–1543.

    Chapter  Google Scholar 

  • Rajpurkar, P., Zhang, J., Lopyrev, K., & Liang, P. (2016). SQuAD: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250. https://doi.org/10.48550/arXiv.1606.05250

    Chapter  Google Scholar 

  • Rezigalla, A A.. (2022). Item analysis: Concept and application. In M. S. Firstenberg & S. P. Stawicki (Eds.), Medical education for the 21st century. IntechOpen. https://doi.org/10.5772/intechopen.100138.

  • Rodriguez-Torrealba, R., Garcia-Lopez, E., & Garcia-Cabot, A. (2022). End-to-end generation of multiple-choice questions using text-to-text transfer transformer models. Expert Systems with Applications, 208, 118258. https://doi.org/10.1016/j.eswa.2022.118258.

    Article  Google Scholar 

  • Settles, B., LaFlair, T. G., & Hagiwara, M. (2020). Machine learning–driven language assessment. Transactions of the Association for Computational Linguistics, 8, 247–263. https://doi.org/10.1162/tacl_a_00310.

    Article  Google Scholar 

  • Seyler, D., Yahya, M., & Berberich, K. (2017). Knowledge questions from knowledge graphs. Proceedings of the ACM SIGIR International Conference on Theory of Information Retrieval, 11–18. https://doi.org/10.1145/3121050.3121073.

  • Song, L., & Zhao, L. (2017). Question generation from a knowledge base with web exploration. arXiv. http://arxiv.org/abs/1610.03807.

  • Suen, H. K. (2012). Principles of test theories. Routledge.

    Book  Google Scholar 

  • Tamura, Y., Takase, Y., Hayashi, Y., & Nakano, Y. I. (2015). Generating quizzes for history learning based on Wikipedia articles. In P. Zaphiris & A. Ioannou (Eds.), Learning and Collaboration Technologies (pp. 337–346). Springer International Publishing. https://doi.org/10.1007/978-3-319-20609-7_32.

  • Tarrant, M., Knierim, A., Hayes, S. K., & Ware, J. (2006). The frequency of item writing flaws in multiple-choice questions used in high-stakes nursing assessments. Nurse Education Today, 26(8), 662–671.

    Article  Google Scholar 

  • Towns, M. H. (2014). Guide to developing high-quality, reliable, and valid multiple-choice assessments. Journal of Chemical Education, 91(9), 1426–1431. https://doi.org/10.1021/ed500076x.

    Article  Google Scholar 

  • Van Campenhout, R., Hubertz, M., & Johnson, B. G. (2022). Evaluating AI-generated questions: A mixed-methods analysis using question data and student perceptions. In M. M. Rodrigo, N. Matsuda, A. I. Cristea, & V. Dimitrova (Eds.), Artificial Intelligence in Education. AIED 2022. Lecture Notes in Computer Science. (Vol. 13355). Cham: Springer. https://doi.org/10.1007/978-3-031-11644-5_28

    Chapter  Google Scholar 

  • Van Campenhout, R., Hubertz, M., & Johnson, B. G. (2022). Evaluating AI-generated questions: A mixed-methods analysis using question data and student perceptions. In M. M. Rodrigo, N. Matsuda, A. I. Cristea, & V. Dimitrova (Eds.), Artificial Intelligence in Education (Vol. 13355, pp. 344–353). Springer International Publishing. https://doi.org/10.1007/978-3-031-11644-5_28.

  • Venktesh, V., Akhtar, Md. S., Mohania, M., & Goyal, V. (2022). Auxiliary task guided interactive attention model for question difficulty prediction. In M. M. Rodrigo, N. Matsuda, A. I. Cristea, & V. Dimitrova (Eds.), Artificial Intelligence in Education (Vol. 13355, pp. 477–489). Springer International Publishing. https://doi.org/10.1007/978-3-031-11644-5_39.

  • Vie, J. J., Popineau, F., Bruillard, É., Bourda, Y. (2017). A review of recent advances in adaptive assessment. In: Peña-Ayala, A. (Ed.), Learning analytics: Fundaments, applications, and trends. Studies in systems, decision, and control (113–142). Springer. https://doi.org/10.1007/978-3-319-52977-6_4.

  • von Davier, M. (2018). Automated item generation with recurrent neural networks. Psychometrika, 83(4), 847–857. https://doi.org/10.1007/s11336-018-9608-y.

    Article  MathSciNet  Google Scholar 

  • Wang, Z., Lan, A. S., & Baraniuk, R. G. (2021). Math word problem generation with mathematical consistency and problem context constraints. arXiv. http://arxiv.org/abs/2109.04546. Accessed 04/04/2023.

  • Wang, Z., Lan, A. S., Nie, W., Waters, A. E., Grimaldi, P. J., & Baraniuk, R. G. (2018). QG-net: A data-driven question generation model for educational content. Proceedings of the Fifth Annual ACM Conference on Learning at Scale, 1–10. https://doi.org/10.1145/3231644.3231654.

  • Wang, Z., Valdez, J., Basu Mallick, D., & Baraniuk, R. G. (2022). Towards human-Like educational question generation with large language models. In M. M. Rodrigo, N. Matsuda, A. I. Cristea, & V. Dimitrova (Eds.), Artificial Intelligence in Education (Vol. 13355, pp. 153–166). Springer International Publishing. https://doi.org/10.1007/978-3-031-11644-5_13.

  • Wauters, K., Desmet, P., & Van Den Noortgate, W. (2012). Item difficulty estimation: An auspicious collaboration between data and judgment. Computers & Education, 58(4), 1183–1193.

    Article  Google Scholar 

  • Wind, S. A., Alemdar, M., Lingle, J. A., Moore, R., & Asilkalkan, A. (2019). Exploring student understanding of the engineering design process using distractor analysis. International Journal of STEM Education, 6(1), 1–18. https://doi.org/10.1186/s40594-018-0156-x.

    Article  Google Scholar 

  • Yang, A. C. M., Chen, I. Y. L., Flanagan, B., & Ogata, H. (2021). Automatic generation of cloze items for repeated testing to improve reading comprehension. Educational Technology & Society, 24(3), 147–158.

    Google Scholar 

  • Zhang, L., & VanLehn, K. (2016). How do machine-generated questions compare to human-generated questions? Research and Practice in Technology Enhanced Learning, 11(1), 7. https://doi.org/10.1186/s41039-016-0031-7.

    Article  Google Scholar 

  • Zilberberg, A., Anderson, R. D., Finney, S. J., & Marsh, K. R. (2013). American college students’ attitudes toward institutional accountability testing: Developing measures. Educational Assessment, 18(3), 208–234. https://doi.org/10.1080/10627197.2013.817153.

    Article  Google Scholar 

Download references

Funding

This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

Author information

Authors and Affiliations

Authors

Contributions

GG: Conceptualization, methodology, formal analysis, writing—original draft preparation. OB: Conceptualization, supervision, writing—review and editing.

Corresponding author

Correspondence to Guher Gorgun.

Ethics declarations

Consent for publication

All authors read and approved the final manuscript.

Competing interests

The authors have no conflicts of interest to declare that are relevant to the content of this article.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Gorgun, G., Bulut, O. Exploring quality criteria and evaluation methods in automated question generation: A comprehensive survey. Educ Inf Technol 29, 24111–24142 (2024). https://doi.org/10.1007/s10639-024-12771-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10639-024-12771-3

Keywords