Abstract
Automated Essay Scoring (AES) plays a crucial role in offering immediate feedback, reducing the workload of educators in grading essays, and improving students’ learning experiences. With strong generalization capabilities, large language models (LLMs) offer a new perspective in AES. While previous research has primarily focused on employing deep learning architectures and models like BERT for feature extraction and scoring, the potential of LLMs in Chinese AES remains largely unexplored. In this paper, we explored the capabilities of LLMs in the realm of Chinese AES. We investigated the effectiveness of the application of well-established LLMs in Chinese AES, e.g., the GPT-series by OpenAI and Qwen-1.8B by Alibaba Cloud. We constructed a Chinese essay dataset with carefully developed rubrics, based on which we acquired grades from human raters. Then we fed in prompts to LLMs, specifically GPT-4, fine-tuned GPT-3.5 and Qwen to get grades, where different strategies were adopted for prompt generations and model fine-tuning. The comparisons between the grades provided by LLMs and human raters suggest that the strategies to generate prompts have a remarkable impact on the grade agreement between LLMs and human raters. When model fine-tuning was adopted, the consistency between LLMs’ scores and human scores was further improved. Comparative experimental results demonstrate that fine-tuned GPT-3.5 and Qwen outperform BERT in QWK score. These results highlight the substantial potential of LLMs in Chinese AES and pave the way for further research in the integration of LLMs within Chinese AES, employing varied strategies for prompt generation and model fine-tuning.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Abraham, B., Nair, M.S.: Automated grading of prostate cancer using convolutional neural network and ordinal class classifier. Inform. Med. Unlocked 17, 100256 (2019)
Bai, J., et al.: Qwen technical report. arXiv preprint arXiv:2309.16609 (2023)
Bai, J.Y.H., et al.: Automated essay scoring (AES) systems: opportunities and challenges for open and distance education. In: Tenth Pan-Commonwealth Forum on Open Learning. Commonwealth of Learning (2022). https://doi.org/10.56059/pcf10.8339
Chen, B., Zhang, Z., Langrené, N., Zhu, S.: Unleashing the potential of prompt engineering in large language models: a comprehensive review (2023). http://arxiv.org/abs/2310.14735. Accessed 26 Mar 2024
Chen, H., He, B., Luo, T., Li, B.: A ranked-based learning approach to automated essay scoring. In: 2012 Second International Conference on Cloud and Green Computing, Xiangtan, Hunan, China, pp. 448–455. IEEE (2012). https://doi.org/10.1109/CGC.2012.41
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Gong, J., et al.: Iflyea: a Chinese essay assessment system with automated rating, review generation, and recommendation. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: System Demonstrations, pp. 240–248 (2021)
Guan, Y., Xie, Y., Liu, X., Sun, Y., Gong, B.: Understanding lexical features for Chinese essay grading. In: Sun, Y., Lu, T., Yu, Z., Fan, H., Gao, L. (eds.) ChineseCSCW 2019. CCIS, vol. 1042, pp. 645–657. Springer, Singapore (2019). https://doi.org/10.1007/978-981-15-1377-0_50
He, Y., Jiang, F., Chu, X., Li, P.: Automated Chinese essay scoring from multiple traits. In: Proceedings of the 29th International Conference on Computational Linguistics, pp. 3007–3016 (2022)
Hu, E.J., et al.: Lora: low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685 (2021)
Hussein, M., Hassan, H., Nassef, M.: Automated language essay scoring systems: a literature review. PeerJ Comput. Sci. 5, e208 (2019)
Li, L., Zhang, H., Li, C., You, H., Cui, W.: Evaluation on ChatGPT for Chinese language understanding. Data Intell. 5(4), 885–903 (2023)
McNamara, D.S., Crossley, S.A., Roscoe, R.D., Allen, L.K., Dai, J.: A hierarchical classification approach to automated essay scoring. Assess. Writ. 23, 35–59 (2015)
Mizumoto, A., Eguchi, M.: Exploring the potential of using an AI language model for automated essay scoring. Res. Methods Appl. Linguist. 2(2), 100050 (2023)
Page, E.B.: Project essay grade: PEG (2003)
Peng, X., Ke, D., Chen, Z., Xu, B.: Automated Chinese essay scoring using vector space models. In: 2010 4th International Universal Communication Symposium, pp. 149–153. IEEE (2010)
Phandi, P., Chai, K.M.A., Ng, H.T.: Flexible domain adaptation for automated essay scoring using correlated linear regression. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 431–439 (2015)
Ramesh, D., Sanampudi, S.K.: An automated essay scoring systems: a systematic literature review. Artif. Intell. Rev. 55(3), 2495–2527 (2022)
Taghipour, K., Ng, H.T.: A neural approach to automated essay scoring. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 1882–1891 (2016)
Vaswani, A., et al.: Attention is all you need. arXiv preprint arXiv:1706.03762 (2023). http://arxiv.org/abs/1706.03762
Wang, Y., Wang, C., Li, R., Lin, H.: On the use of BERT for automated essay scoring: joint learning of multi-scale essay representation. arXiv preprint arXiv:2205.03835 (2022)
Wei, J., et al.: Chain-of-thought prompting elicits reasoning in large language models. Adv. Neural. Inf. Process. Syst. 35, 24824–24837 (2022)
Xiao, C., Ma, W., Xu, S.X., Zhang, K., Wang, Y., Fu, Q.: From automation to augmentation: Large language models elevating essay scoring landscape. arXiv preprint arXiv:2401.06431 (2024)
Yancey, K.P., Laflair, G., Verardi, A., Burstein, J.: Rating short L2 essays on the CEFR scale with GPT-4. In: Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023), pp. 576–584 (2023)
Yang, H., He, Y., Bu, X., Xu, H., Guo, W.: Automatic essay evaluation technologies in Chinese writing-a systematic literature review. Appl. Sci. 13(19), 10737 (2023)
Yang, R., Cao, J., Wen, Z., Wu, Y., He, X.: Enhancing automated essay scoring performance via fine-tuning pre-trained language models with combination of regression and ranking. In: Findings of the Association for Computational Linguistics: EMNLP 2020, pp. 1560–1569 (2020)
Zheng, C., Guo, S., Xia, W., Mao, S.: Elion: an intelligent Chinese composition tutoring system based on large language models. Chinese/English J. Educ. Measur. Eval. 4(3), 3 (2023)
Acknowledgements
This research is supported by the Ministry of Education, Singapore, under its Academic Research Fund Tier 1 (A Trustworthy Feedback Agent for Secondary School Chinese Language Learning), the Shenzhen Science and Technology Foundation (General Pro gram, JCYJ20210324093212034), and 2022 Guangdong Province Undergraduate University Quality Engineering Project (Shenzhen University Academic Affairs [2022] No. 7).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendix
Appendix
The detailed scoring rubrics and the essay example from K3 students can be accessed at https://github.com/seamoon224/AIED-2024.
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Feng, H. et al. (2024). Leveraging Large Language Models for Automated Chinese Essay Scoring. In: Olney, A.M., Chounta, IA., Liu, Z., Santos, O.C., Bittencourt, I.I. (eds) Artificial Intelligence in Education. AIED 2024. Lecture Notes in Computer Science(), vol 14829. Springer, Cham. https://doi.org/10.1007/978-3-031-64302-6_32
Download citation
DOI: https://doi.org/10.1007/978-3-031-64302-6_32
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-64301-9
Online ISBN: 978-3-031-64302-6
eBook Packages: Computer ScienceComputer Science (R0)