Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1007/978-3-031-62700-2_5guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
Article

CSEPrompts: A Benchmark of Introductory Computer Science Prompts

Published: 17 June 2024 Publication History

Abstract

Recent advances in AI, machine learning, and NLP have led to the development of a new generation of Large Language Models (LLMs) that are trained on massive amounts of data and often have trillions of parameters. Commercial applications (e.g., ChatGPT) have made this technology available to the general public, thus making it possible to use LLMs to produce high-quality texts for academic and professional purposes. Schools and universities are aware of the increasing use of AI-generated content by students and they have been researching the impact of this new technology and its potential misuse. Educational programs in Computer Science (CS) and related fields are particularly affected because LLMs are also capable of generating programming code in various programming languages. To help understand the potential impact of publicly available LLMs in CS education, we introduce CSEPrompts (https://github.com/mraihan-gmu/CSEPrompts), a framework with hundreds of programming exercise prompts and multiple-choice questions retrieved from introductory CS and programming courses. We also provide experimental results on CSEPrompts to evaluate the performance of several LLMs with respect to generating Python code and answering basic computer science and programming questions.

References

[1]
OpenAI, GPT-4 Technical Report. ArXiv arxiv:2303.08774 (2023)
[2]
Anil, R., et al.: PaLM 2 Technical Report (2023)
[3]
Touvron, H., Martin, L., et al.: Llama 2: open foundation and fine-tuned chat models (2023)
[4]
Penedo, G., Malartic, Q., Hesslow, D., et al.: The RefinedWeb dataset for falcon LLM: outperforming curated corpora with web data, and web data only (2023)
[5]
The MosaicML NLP Team, MPT-30B: raising the bar for open-source foundation models (2023)
[6]
Islamovic, A.: Stability AI launches the first of its StableLM suite of language models (2023)
[7]
Biderman, S.: Pythia: a suite for analyzing large language models across training and scaling. In: EleutherAI (2023)
[8]
Nori, H., King, N., McKinney, S.M., Carignan, D., Horvitz, E.: Capabilities of gpt-4 on medical challenge problems. arXiv preprint arXiv:2303.13375 (2023)
[9]
Katz, D.M., Bommarito, M.J., Gao, S., Arredondo, P.: Gpt-4 passes the bar exam. SSRN (2023)
[10]
Tack, A.: The AI teacher test: measuring the pedagogical ability of blender and GPT-3 in educational dialogues (2022)
[11]
Haruna-Cooper, L., Rashid, M.A.: GPT-4: the future of artificial intelligence in medical school assessments. J. Roy. Soc. Med. 01410768231181251 (2023)
[12]
Lukasczyk, S., Fraser, G.: Pynguin: automated unit test generation for python, pp. 168–172 (2022)
[13]
Krekel, H., Pytest-dev team.: Pytest: helps you write better programs (2023)
[14]
Rogers A, Kovaleva O, and Rumshisky A A primer in BERTology: what we know about how BERT works Trans. Assoc. Comput. Linguist. 2020 8 842-866
[15]
Zhang, S.J., Florin, S., Lee, A.N., et al.: Exploring the MIT mathematics and EECS curriculum using large language models. arXiv preprint arXiv:2306.08997 (2023)
[16]
Lo CK What is the impact of ChatGPT on education? a rapid review of the literature Educ. Sci. 2023 13 4 410
[17]
Sok, S., Heng, K.: ChatGPT for education and research: a review of benefits and risks. SSRN 4378735 (2023)
[18]
Halaweh, M.: ChatGPT in education: strategies for responsible implementation. Contemp. Educ. Technol. 15 (2) (2023)
[19]
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality (2013)
[20]
Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors for word representation (2014)
[21]
Peters, M.E., et al.: Deep contextualized word representations (2018)
[22]
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding (2018)
[23]
Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., et al.: Code llama: open foundation models for code. arXiv preprint arXiv:2308.12950 (2023)
[24]
Jiang, A.Q., Sablayrolles, A., Mensch, A., et al.: Mistral 7B. arXiv preprint arXiv:2310.06825 (2023)
[25]
Li, R., Allal, L.B., Zi, Y., Muennighoff, N., Kocetkov, D., et al.: StarCoder: may the source be with you!. arXiv preprint arXiv:2305.06161 (2023)
[26]
Luo, Z., et al.: WizardCoder: empowering code large language models with evol-instruct. arXiv preprint arXiv:2306.08568 (2023)
[27]
Savelka, J., Agarwal, A., Bogart, C., Song, Y., Sakr, M.: Can generative pre-trained transformers (GPT) pass assessments in higher education programming courses?. arXiv preprint arXiv:2303.09325 (2023)
[28]
Zan, D., et al.: Large language models meet NL2Code: a survey. In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, vol. 1: Long Papers (2023)
[29]
Surameery, N.M.S., Shakor, M.Y.: Use chat gpt to solve programming bugs. Int. J. Inf. Technol. Comput. Eng. (IJITC) 3(01), 17–22 (2023). ISSN: 2455-5290
[30]
Austin, J., Odena, A., Nye, M., Bosma, M., et al.: Program synthesis with large language models. arXiv preprint arXiv:2108.07732 (2021)
[31]
Feng, Z., Guo, D., Tang, D., Duan, N., et al.: CodeBERT: a pre-trained model for programming and natural languages (2020)
[32]
Guo, D., Ren, S., Lu, S., Feng, Z., Tang, D., et al.: Graphcodebert: pre-training code representations with data flow. arXiv preprint arXiv:2009.08366 (2020)
[33]
Wang, X., et al.: Syncobert: syntax-guided multi-modal contrastive pre-training for code representation. arXiv preprint arXiv:2108.04556 (2021)
[34]
Wang, Y., Wang, W., Joty, S., Hoi, S.C.H.: CodeT5: identifier-aware unified pre-trained encoder-decoder models for code understanding and generation (2021)
[35]
Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., et al., Code llama: open foundation models for code. arXiv preprint arXiv:2308.12950 (2023)
[36]
Iyer, S., Konstas, I., Cheung, A., Zettlemoyer, L.: Mapping language to code in programmatic context (2018)
[37]
Liu, J., Xia, C.S., Wang, Y., Zhang, L.: Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation. arXiv preprint arXiv:2305.01210 (2023)
[38]
Lai, Y., Li, C., Wang, Y., Zhang, T., Zhong, R.: DS-1000: a natural and reliable benchmark for data science code generation (2023)
[39]
Guo, W., Yang, J., Yang, K., Li, X., et al.: Instruction fusion: advancing prompt evolution through hybridization. arXiv preprint arXiv:2312.15692 (2023)
[40]
Babe, H.M., et al.: StudentEval: a benchmark of student-written prompts for large language models of code. arXiv preprint arXiv:2306.04556 (2023)

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings
Foundations of Intelligent Systems: 27th International Symposium, ISMIS 2024, Poitiers, France, June 17–19, 2024, Proceedings
Jun 2024
318 pages
ISBN:978-3-031-62699-9
DOI:10.1007/978-3-031-62700-2
  • Editors:
  • Annalisa Appice,
  • Hanane Azzag,
  • Mohand-Said Hacid,
  • Allel Hadjali,
  • Zbigniew Ras

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 17 June 2024

Author Tags

  1. Benchmark Dataset
  2. Code LLM
  3. Prompting

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 0
    Total Downloads
  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 26 Jan 2025

Other Metrics

Citations

View Options

View options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media