Leveraging Large Language Models for Automated Chinese Essay Scoring

Feng, Haiyue; Du, Sixuan; Zhu, Gaoxia; Zou, Yan; Phua, Poh Boon; Feng, Yuhong; Zhong, Haoming; Shen, Zhiqi; Liu, Siyuan

doi:10.1007/978-3-031-64302-6_32

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14829))

Included in the following conference series:

International Conference on Artificial Intelligence in Education

2300 Accesses

Abstract

Automated Essay Scoring (AES) plays a crucial role in offering immediate feedback, reducing the workload of educators in grading essays, and improving students’ learning experiences. With strong generalization capabilities, large language models (LLMs) offer a new perspective in AES. While previous research has primarily focused on employing deep learning architectures and models like BERT for feature extraction and scoring, the potential of LLMs in Chinese AES remains largely unexplored. In this paper, we explored the capabilities of LLMs in the realm of Chinese AES. We investigated the effectiveness of the application of well-established LLMs in Chinese AES, e.g., the GPT-series by OpenAI and Qwen-1.8B by Alibaba Cloud. We constructed a Chinese essay dataset with carefully developed rubrics, based on which we acquired grades from human raters. Then we fed in prompts to LLMs, specifically GPT-4, fine-tuned GPT-3.5 and Qwen to get grades, where different strategies were adopted for prompt generations and model fine-tuning. The comparisons between the grades provided by LLMs and human raters suggest that the strategies to generate prompts have a remarkable impact on the grade agreement between LLMs and human raters. When model fine-tuning was adopted, the consistency between LLMs’ scores and human scores was further improved. Comparative experimental results demonstrate that fine-tuned GPT-3.5 and Qwen outperform BERT in QWK score. These results highlight the substantial potential of LLMs in Chinese AES and pave the way for further research in the integration of LLMs within Chinese AES, employing varied strategies for prompt generation and model fine-tuning.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 74.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

A Study on Performance Sensitivity to Data Sparsity for Automated Essay Scoring

Applying large language models for automated essay scoring for non-native Japanese

Article Open access 03 June 2024

Automated Pipeline for Multi-lingual Automated Essay Scoring with ReaderBench

Article 01 April 2024

Notes

References

Abraham, B., Nair, M.S.: Automated grading of prostate cancer using convolutional neural network and ordinal class classifier. Inform. Med. Unlocked 17, 100256 (2019)
Article Google Scholar
Bai, J., et al.: Qwen technical report. arXiv preprint arXiv:2309.16609 (2023)
Bai, J.Y.H., et al.: Automated essay scoring (AES) systems: opportunities and challenges for open and distance education. In: Tenth Pan-Commonwealth Forum on Open Learning. Commonwealth of Learning (2022). https://doi.org/10.56059/pcf10.8339
Chen, B., Zhang, Z., Langrené, N., Zhu, S.: Unleashing the potential of prompt engineering in large language models: a comprehensive review (2023). http://arxiv.org/abs/2310.14735. Accessed 26 Mar 2024
Chen, H., He, B., Luo, T., Li, B.: A ranked-based learning approach to automated essay scoring. In: 2012 Second International Conference on Cloud and Green Computing, Xiangtan, Hunan, China, pp. 448–455. IEEE (2012). https://doi.org/10.1109/CGC.2012.41
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Gong, J., et al.: Iflyea: a Chinese essay assessment system with automated rating, review generation, and recommendation. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: System Demonstrations, pp. 240–248 (2021)
Google Scholar
Guan, Y., Xie, Y., Liu, X., Sun, Y., Gong, B.: Understanding lexical features for Chinese essay grading. In: Sun, Y., Lu, T., Yu, Z., Fan, H., Gao, L. (eds.) ChineseCSCW 2019. CCIS, vol. 1042, pp. 645–657. Springer, Singapore (2019). https://doi.org/10.1007/978-981-15-1377-0_50
Chapter Google Scholar
He, Y., Jiang, F., Chu, X., Li, P.: Automated Chinese essay scoring from multiple traits. In: Proceedings of the 29th International Conference on Computational Linguistics, pp. 3007–3016 (2022)
Google Scholar
Hu, E.J., et al.: Lora: low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685 (2021)
Hussein, M., Hassan, H., Nassef, M.: Automated language essay scoring systems: a literature review. PeerJ Comput. Sci. 5, e208 (2019)
Article Google Scholar
Li, L., Zhang, H., Li, C., You, H., Cui, W.: Evaluation on ChatGPT for Chinese language understanding. Data Intell. 5(4), 885–903 (2023)
Article Google Scholar
McNamara, D.S., Crossley, S.A., Roscoe, R.D., Allen, L.K., Dai, J.: A hierarchical classification approach to automated essay scoring. Assess. Writ. 23, 35–59 (2015)
Article Google Scholar
Mizumoto, A., Eguchi, M.: Exploring the potential of using an AI language model for automated essay scoring. Res. Methods Appl. Linguist. 2(2), 100050 (2023)
Article Google Scholar
Page, E.B.: Project essay grade: PEG (2003)
Google Scholar
Peng, X., Ke, D., Chen, Z., Xu, B.: Automated Chinese essay scoring using vector space models. In: 2010 4th International Universal Communication Symposium, pp. 149–153. IEEE (2010)
Google Scholar
Phandi, P., Chai, K.M.A., Ng, H.T.: Flexible domain adaptation for automated essay scoring using correlated linear regression. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 431–439 (2015)
Google Scholar
Ramesh, D., Sanampudi, S.K.: An automated essay scoring systems: a systematic literature review. Artif. Intell. Rev. 55(3), 2495–2527 (2022)
Article Google Scholar
Taghipour, K., Ng, H.T.: A neural approach to automated essay scoring. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 1882–1891 (2016)
Google Scholar
Vaswani, A., et al.: Attention is all you need. arXiv preprint arXiv:1706.03762 (2023). http://arxiv.org/abs/1706.03762
Wang, Y., Wang, C., Li, R., Lin, H.: On the use of BERT for automated essay scoring: joint learning of multi-scale essay representation. arXiv preprint arXiv:2205.03835 (2022)
Wei, J., et al.: Chain-of-thought prompting elicits reasoning in large language models. Adv. Neural. Inf. Process. Syst. 35, 24824–24837 (2022)
Google Scholar
Xiao, C., Ma, W., Xu, S.X., Zhang, K., Wang, Y., Fu, Q.: From automation to augmentation: Large language models elevating essay scoring landscape. arXiv preprint arXiv:2401.06431 (2024)
Yancey, K.P., Laflair, G., Verardi, A., Burstein, J.: Rating short L2 essays on the CEFR scale with GPT-4. In: Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023), pp. 576–584 (2023)
Google Scholar
Yang, H., He, Y., Bu, X., Xu, H., Guo, W.: Automatic essay evaluation technologies in Chinese writing-a systematic literature review. Appl. Sci. 13(19), 10737 (2023)
Article Google Scholar
Yang, R., Cao, J., Wen, Z., Wu, Y., He, X.: Enhancing automated essay scoring performance via fine-tuning pre-trained language models with combination of regression and ranking. In: Findings of the Association for Computational Linguistics: EMNLP 2020, pp. 1560–1569 (2020)
Google Scholar
Zheng, C., Guo, S., Xia, W., Mao, S.: Elion: an intelligent Chinese composition tutoring system based on large language models. Chinese/English J. Educ. Measur. Eval. 4(3), 3 (2023)
Google Scholar

Download references

Acknowledgements

This research is supported by the Ministry of Education, Singapore, under its Academic Research Fund Tier 1 (A Trustworthy Feedback Agent for Secondary School Chinese Language Learning), the Shenzhen Science and Technology Foundation (General Pro gram, JCYJ20210324093212034), and 2022 Guangdong Province Undergraduate University Quality Engineering Project (Shenzhen University Academic Affairs [2022] No. 7).

Author information

Authors and Affiliations

College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, China
Haiyue Feng & Yuhong Feng
College of Computing and Data Science, Nanyang Technological University, Singapore, Singapore
Haiyue Feng, Sixuan Du, Gaoxia Zhu, Zhiqi Shen & Siyuan Liu
Business School, Hohai University, Nanjing, Jiangsu, China
Sixuan Du
Webank, Shenzhen, China
Haoming Zhong
Woodlands Secondary School, Singapore, Singapore
Yan Zou & Poh Boon Phua

Authors

Haiyue Feng
View author publications
You can also search for this author in PubMed Google Scholar
Sixuan Du
View author publications
You can also search for this author in PubMed Google Scholar
Gaoxia Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Yan Zou
View author publications
You can also search for this author in PubMed Google Scholar
Poh Boon Phua
View author publications
You can also search for this author in PubMed Google Scholar
Yuhong Feng
View author publications
You can also search for this author in PubMed Google Scholar
Haoming Zhong
View author publications
You can also search for this author in PubMed Google Scholar
Zhiqi Shen
View author publications
You can also search for this author in PubMed Google Scholar
Siyuan Liu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yuhong Feng .

Editor information

Editors and Affiliations

University of Memphis, Memphis, TN, USA
Andrew M. Olney
University of Duisburg-Essen, Duisburg, Germany
Irene-Angelica Chounta
Jinan University, Guangzhou, China
Zitao Liu
UNED, Madrid, Spain
Olga C. Santos
Universidade Federal de Alagoas, Maceio, Brazil
Ig Ibert Bittencourt

Appendix

The detailed scoring rubrics and the essay example from K3 students can be accessed at https://github.com/seamoon224/AIED-2024.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Feng, H. et al. (2024). Leveraging Large Language Models for Automated Chinese Essay Scoring. In: Olney, A.M., Chounta, IA., Liu, Z., Santos, O.C., Bittencourt, I.I. (eds) Artificial Intelligence in Education. AIED 2024. Lecture Notes in Computer Science(), vol 14829. Springer, Cham. https://doi.org/10.1007/978-3-031-64302-6_32

Download citation

DOI: https://doi.org/10.1007/978-3-031-64302-6_32
Published: 02 July 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-64301-9
Online ISBN: 978-3-031-64302-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Leveraging Large Language Models for Automated Chinese Essay Scoring

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

A Study on Performance Sensitivity to Data Sparsity for Automated Essay Scoring

Applying large language models for automated essay scoring for non-native Japanese

Automated Pipeline for Multi-lingual Automated Essay Scoring with ReaderBench

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Appendix

Appendix

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us