Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
4 views

Applying Large Language Models to Enhance the Assessment of Parallel Functional Programming Assignments

This paper discusses the application of ChatGPT-4 in automating the assessment of programming assignments in computer science education, specifically in a course on Parallel Functional Programming at Vanderbilt University. It introduces the GreAIter tool, which utilizes prompt engineering to enhance grading efficiency and objectivity by providing detailed feedback based on predefined rubrics, while still involving human graders to ensure accuracy. The study demonstrates that ChatGPT-4 can effectively complement traditional grading methods, addressing challenges related to scalability and inter-rater reliability in large classes.

Uploaded by

Isha Shah
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Applying Large Language Models to Enhance the Assessment of Parallel Functional Programming Assignments

This paper discusses the application of ChatGPT-4 in automating the assessment of programming assignments in computer science education, specifically in a course on Parallel Functional Programming at Vanderbilt University. It introduces the GreAIter tool, which utilizes prompt engineering to enhance grading efficiency and objectivity by providing detailed feedback based on predefined rubrics, while still involving human graders to ensure accuracy. The study demonstrates that ChatGPT-4 can effectively complement traditional grading methods, addressing challenges related to scalability and inter-rater reliability in large classes.

Uploaded by

Isha Shah
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

2024 International Workshop on Large Language Models for Code (LLM4Code)

Applying Large Language Models to Enhance the Assessment


of Parallel Functional Programming Assignments
Skyler Grandel Douglas C. Schmidt Kevin Leach
skyler.h.grandel@vanderbilt.edu d.schmidt@vanderbilt.edu kevin.leach@vanderbilt.edu
Vanderbilt University Vanderbilt University Vanderbilt University
Nashville, Tennessee, USA Nashville, Tennessee, USA Nashville, Tennessee, USA
ABSTRACT 1 INTRODUCTION
Courses in computer science (CS) often assess student program- Conversational large language models (LLMs), such as ChatGPT-
ming assignments manually, with the intent of providing in-depth 4 [13], have proven effective in a range of domains, including code
feedback to each student regarding correctness, style, efficiency, and generation and analysis [25]. LLMs are particularly promising in
other quality attributes. As class sizes increase, however, it is hard domains where humans and AI tools collaborate to more rapidly
to provide detailed feedback consistently, especially when multiple and reliably solve software problems [5, 24]. These advances have
assessors are required to handle a larger number of assignment sub- enabled the application of LLMs in educational domains, partic-
missions. Large language models (LLMs), such as ChatGPT, offer a ularly in disciplines that benefit from automated and/or assisted
promising alternative to help automate this process in a consistent, analysis of textual or programming content [14, 19, 20].
scalable, and minimally-biased manner. Motivating the need for more effective and scalable CS
This paper explores ChatGPT-4’s scalablility and accuracy in program assessment tools. In the context of computer science
assessing programming assignments based on predefined rubrics (CS) education, assessing programming assignments is a task that
in the context of a case study we conducted in an upper-level un- traditionally requires a considerable expenditure of time and ef-
dergraduate and graduate CS course at Vanderbilt University. In fort from instructors and/or graders. Moreover, the quality of the
this case study, we employed a method that compared assessments grading process is often susceptible to human error and subjectiv-
generated by ChatGPT-4 against human graders to measure the ity [15], particularly as class size grows. To mitigate such errors and
accuracy, precision, and recall associated with identifying program- accelerate the grading process, this paper explores the application
ming mistakes. Our results show that when ChatGPT-4 is used of ChatGPT-4 to perform more objective and efficient analysis and
properly (e.g., with appropriate prompt engineering and feature assessment of student programs. This approach is increasing rele-
selection) it can improve objectivity and grading efficiency, thereby vant as enrollments in CS classes increase, which often necessitates
acting as a complementary tool to human graders for advanced the use of multiple graders whose inconsistencies (known as the
computer science graduate and undergraduate students. “inter-rater reliability problem”) can pose substantial challenges for
fair and reliable grading [7, 15].
CCS CONCEPTS To cope with the impact of scale in large CS classes, programming
assignments are often assessed via automated graders, which are
• Software and its engineering → Software maintenance tools; •
similar to unit and/or integration test suites. While automated
Applied computing → Computer-assisted instruction.
graders are useful in helping to assess functional correctness, they
offer limited aid in judging programming style, efficiency, and other
KEYWORDS quality attributes [4, 6, 15]. Moreover, automated test suites may
ChatGPT, Education, Generative AI, Large Language Models, Prompt stifle student creativity by mandating overly restrictive structures
Engineering, Automated Grading to fit within the "Procrustean Bed" of auto-graders.
In the case of coding style, many instructors adopt “linters” to
ACM Reference Format: identify stylistic mistakes [10]. However, these tools are limited in
Skyler Grandel, Douglas C. Schmidt, and Kevin Leach. 2024. Applying Large
their ability to capture certain elements of coding style, such as doc-
Language Models to Enhance the Assessment of Parallel Functional Pro-
umentation completeness or holistic readability. In contrast, LLMs,
gramming Assignments. In 2024 International Workshop on Large Language
Models for Code (LLM4Code ’24), April 20, 2024, Lisbon, Portugal. ACM, New such as ChatGPT-4 or Claude, offer a more flexible qualitative grad-
York, NY, USA, 9 pages. https://doi.org/10.1145/3643795.3648375 ing solution that can evaluate functionality, coding style, efficiency,
and other quality attributes in a largely automated fashion.
More generally, the integration of generative AI tools into CS
Permission to make digital or hard copies of all or part of this work for personal or pedagogical practices can pave the way for more personalized and
classroom use is granted without fee provided that copies are not made or distributed adaptive learning experiences through bespoke feedback that con-
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for components of this work owned by others than the
ventional unit and integration tests cannot provide [2, 4, 6, 15].
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or This paper thus presents the results of a case study conducted in an
republish, to post on servers or to redistribute to lists, requires prior specific permission upper-level undergraduate and graduate course at Vanderbilt Uni-
and/or a fee. Request permissions from permissions@acm.org.
LLM4Code ’24, April 20, 2024, Lisbon, Portugal
versity entitled "Parallel Functional Programming in Java".1 This
© 2024 Copyright held by the owner/author(s). Publication rights licensed to ACM.
ACM ISBN 979-8-4007-0579-3/24/04. . . $15.00 1 All
course material is available at www.dre.vanderbilt.edu/∼schmidt/cs253 in
https://doi.org/10.1145/3643795.3648375 open-source form.

102
LLM4Code ’24, April 20, 2024, Lisbon, Portugal Skyler Grandel, Douglas C. Schmidt, and Kevin Leach

case study explores how the application of LLMs in grading and pro- prompting strategies we used to achieve our results; Section 3 ex-
gramming assignment assessment through a semi-automated grad- plains the experiment we designed to assess the performance of our
ing methodology using ChatGPT-4 can supplement and enhance methodology in grading programming assignments in the "Parallel
conventional automated graders and traditional manual grading. Functional Programming in Java" course; Section 4 evaluates the
Solution approach → The GreAIter LLM-based auto-grading results of our GreAIter grading tool using ChatGPT-4; Section 5 ex-
tool. We rely on prompt engineering techniques [24] to converse plores the limitations and threats to validity of our work, Section 6
effectively with ChatGPT-4. A prompt is a set of instructions pro- compares our research with related work on AI-assisted program-
vided to an LLM that programs the LLM by customizing it and/or ming assignment evaluation; and Section 7 presents concluding
enhancing or refining its capabilities, which can influence behav- remarks and future work.
ior of and interactions with LLMs [25]. Prompt engineering is the
means by which LLMs are programmed via prompts, guided by 2 METHODOLOGY
experience [23, 26] on applying LLMs effectively. We applied these This section describes the methodology of our study, focusing on
techniques to develop an LLM-based auto-grading tool (known the design of our LLM-based GreAIter auto-grading tool and the
as "GreAIter") that assists human graders in locating faults and prompt engineering strategies we applied to achieve our results.
generating accurate and meaningful feedback for students.
Our case study implemented GreAIter using ChatGPT-4 and 2.1 Overview of GreAIter and our AI-assisted
applied it to evaluate programming assignments in the "Parallel Grading Process
Functional Programming in Java" course. We defined rubrics using
Figure 1 depicts the steps involved in our grading process using
a structured JSON format with specific grading criteria to commu-
GreAIter. We begin with the student submission (1) and the rubric
nicate the grading methodology to ChatGPT-4. For each criterion
contained in the rubric, ChatGPT-4 was instructed to
(1) (4)
(1) Output the code from each student submission that is rele- Student Submission Human Grader
vant to that criterion and a grade of “correct” or “incorrect”
and then
(2) Compile a summary of all mistakes made by the student and  
0
outputs suggested feedback for the student based on their 1
 
submission and the mistakes therein. 0
 
(2) ..
.
The results of this assessment was reviewed by human graders to Subtractive Rubric
produce the student’s final grade, thereby yielding an efficient, accu- (5)
Binary Grade For
rate, and minimally-biased grade due to the collaboration between (3)
Each Rubric Item
Intermediate
humans and the GreAIter generative AI tool. Analysis
Evaluation approach and research contributions. To evalu-
ate our approach, we assessed ChatGPT-4’s performance in grading
programming assignments and compared this performance to that Figure 1: GreAIter’s AI-Assisted Grading Process.
of human graders. We conducted this comparison by examining
ChatGPT’s accuracy and efficiency in evaluating student submis- being input (2) into ChatGPT-4. This LLM then conducts an inter-
sions given a rubric compared to a human grader. We investigated mediate analysis (3), consisting of a detailed evaluation for each
the false positive and false negative rate through precision and criterion in the rubric. ChatGPT-4 then summarizes its assessment
recall to determine ChatGPT-4’s shortcomings as a grader and ana- and a human grader reviews the output (4), verifying and adjusting
lyzed how it can be used in a semi-automated approach. We further ChatGPT-4’s evaluations as needed. The final output from the hu-
examined the objectivity of results to assess ChatGPT-4’s potential man grader (5) consists of a binary grade for each criterion in the
in reducing subjective bias in grading by comparing results from rubric, indicating the performance of the student’s programming
multiple grading attempts given the same submission. assignment submission against each criterion.
This paper provides the following contributions to research on GreAIter provides a bridge between theoretical generative AI
AI-assisted programming assignment evaluation: capabilities and practical educational applications. This tool enables
• It provides empirical evidence regarding the utility of ChatGPT- instructors to use LLMs effectively to improve grader efficiency
4 as a tool for assistance in assessing programming assign- and objectivity by providing an automated system that interprets
ments for advanced computer science topics, i.e., parallel student submissions against a predefined rubric and produces an
functional programming. objective assessment mirroring human graders’ processes. GreAIter
• It offers insights into the extent to which LLMs like ChatGPT- leverages rubric-based evaluation, where each criterion is clearly
4 can be relied upon for accurate assessment in educational defined via a structured format. This format provides ChatGPT-
settings, potentially setting the stage for broader adoption 4 with the parameters needed to assess student submissions and
and further technological development. ensure consistency across multiple evaluations, thereby addressing
Paper organization. The remainder of this paper is organized the inter-rater reliability problem [7].
as follows: Section 2 describes the methodology of our work, en- While GreAIter contains no elements that are specific to any
compassing the design of our GreAIter auto-grading tool and the particular programming assignment, we recommend certain steps

103
Applying Large Language Models to Enhance the Assessment
of Parallel Functional Programming Assignments LLM4Code ’24, April 20, 2024, Lisbon, Portugal

for integrating it into a class.2 In particular, a reliable and structured and balancing the efficiency of automation without foregoing the
rubric is necessary for effective results. ChatGPT-4 has been shown expertise of human educators. The result is a hybrid model that
by others [27] as an adequate prompt engineer on the level of aims for high-quality, scalable assessment mechanisms that both
humans, and we leverage this capability to generate rubrics that educators and students can reply upon.
GreAIter uses to prompt ChatGPT-4. In particular, ChatGPT-4 can At the heart of GreAIter’s functionality is prompt engineering,
generate a usable rubric given a (1) list of potential mistakes, (2) an i.e.. the intentional design of prompts that guide LLMs in performing
“answer key” (i.e., the desired assignment solution), and (3) the JSON their tasks. For our GreAIter process, prompts are carefully crafted
structure shown in Figure 2. We follow this approach to generate the to elicit specific behaviors from ChatGPT-4, enabling it to under-
rubrics for our experiments, with researchers verifying each rubric stand and apply the grading rubrics accurately. Due to ChatGPT-4’s
criterion that ChatGPT-4 outputs to ensure quality of description familiarity with JSON, we found that formatting the rubric using
and correct and incorrect examples. JSON (as shown in Figure 2) is an effective prompting strategy to
Our initial round of experiments indicated that ChatGPT-4 was
prone to generating code for the incorrect example in a rubric
criterion that is different than real mistakes students may make.
For example, we might want ChatGPT-4 to ensure that students
use Java method references rather than lambda expressions where
possible for stylistic reasons. Here, the desired solution might look
something like, ".map(this::someFunction)", while ChatGPT-4 might
generate something like ".map(item -> )" instead of "map(item -
> someFunction(item))", which is semantically equivalent to the
correct answer. We therefore provided examples of incorrect code
in prompts to generate these rubrics and verified the incorrect
example to ensure it exhibited realistic mistaken behavior.
2.2 Prompt Engineering and Human-AI
Collaboration
While GreAIter is capable of operating in a fully autonomous mode,
we applied a semi-autonomous method due to limitations associated
with current LLM technologies, including ChatGPT-4 [3, 23–25].
Despite its advanced capabilities, ChatGPT-4 can generate errors Figure 2: Example JSON to Provide a Rubric to ChatGPT-4.
(commonly referred to as "hallucinations"), where it confidently
ensure ChatGPT-4 accurately parses the information in the rubric.
asserts inaccurate or nonsensical information [3]. This tendency
We therefore define our rubric as a JSON array where each element
is problematic for educational assessments, where the stakes of
contains an object representing a rubric criterion. Each rubric cri-
incorrect evaluations are high, as they may significantly impact a
terion contains entries for the criterion’s title, description, and a
student’s learning trajectory and academic record.
correct and incorrect example.
Given a programming assignment and rubric, GreAIter gener-
We used the following prompt to instruct ChatGPT-4 on the use
ated feedback for human graders to review. As a final sanity check,
of this rubric:
human graders then checked the relevant segment(s) of student
You are a grader for the parallel functional program-
code identified by GreAIter to manually verifying the issues it
ming course taught in Java. I will give you a JSON
flagged were indeed mistakes (rather than false positives). Hu-
rubric and student Java code. For each item in the
man graders thus scored each student appropriately and reviewed
rubric, you will first output the function in the stu-
GreAIter’s feedback before returning results to students via GRADE
dent’s code that is relevant to that item and then you
files pushed to their GitLab repositories.
will output a score of "correct" or "incorrect". Alter-
By integrating a human-in-the-loop approach, we introduced a
native answers to the correct code are permissible if
crucial verification step. Human graders review the AI-generated
they have the same functionality and do not apply
assessments, ensuring the reliability of the final output. This safe-
poor style conventions.
guard is not merely a corrective measure, it also reinforces the edu-
cational value of the grading process. A human grader’s oversight This prompt was followed by the rubric and the student’s code. A
ensures that feedback is pedagogically appropriate and contextually subsequent request instructed ChatGPT-4 to compile a comprehen-
relevant to each student’s learning needs. sive summary of errors or misalignments with the rubric’s expecta-
Our semi-automated approach also aligns with ethical guidelines, tions, along with suggested feedback for the student based on their
promoting responsible AI by mitigating risks associated with unver- specific mistakes. This prompt forced ChatGPT-4 to consider each
ified autonomous AI operation in high-stakes application domains individual criterion in the rubric, and then used a chain-of-thought3
like primary and secondary education. Our approach respects the prompting strategy by asking the LLM to output the relevant code
sophistication of the AI while prudently managing its limitations before making a judgement about its correctness. These strategies
3 Chain-of-thought prompting [23] instructs an LLM to explain its “thought process”
2 GreAIter is available to instructors of CS programming courses on request. before giving an answer to improve answer quality.

104
LLM4Code ’24, April 20, 2024, Lisbon, Portugal Skyler Grandel, Douglas C. Schmidt, and Kevin Leach

helped to minimize ChatGPT’s tendency to hallucinate errors, skip benchmark. To eliminate inter-rater bias, one graduate teaching
rubric criteria, and/or consider irrelevant parts of student code. assistant (TA) grader with five years of experience with Java and
Based on our experience with the GreAIter case study, including two years of experience with Java parallel functional programming
both correct and incorrect examples in the rubric is crucial for initially graded all submissions. These grades were then reviewed
several reasons. First, it enables a "few-shot learning"4 method that by the course instructor to ensure accuracy.
provides context and clarifies edge cases and potentially ambiguous The TA grader was given the same rubrics as GreAIter, though
instructions. Second, these examples aid ChatGPT-4 in providing these rubrics were formatted as plain text instead of JSON for
specific feedback to students by comparing their solutions to the readability. We recognized that a single TA grader reviewed by a
desired solution. single instructor might exhibit potential bias, so we investigated
2.3 Assessment Process and Ethical each inaccuracy carefully to determine the cause of potential flaws.5
Our experiment considered the following three research ques-
Considerations
tions:
Our GreAIter-based assessment process began with ChatGPT-4 RQ1: Performance Can GreAIter perform correctly by identify-
receiving each student’s code and the associated rubric through ing mistakes in student program submissions?
our prompts, as shown by step (1) in Figure 1. ChatGPT-4 then RQ2: Efficiency What is the reduction in the amount of manual
systematically evaluated the code, criterion-by-criterion, referenc- grading that must be done when using GreAIter compared
ing specific code segments as evidence for its assessments. Our to traditional manual grading?
prompts were designed to ensure that ChatGPT-4’s evaluation was RQ3: Consistency How consistent is our LLM grading method-
not merely keyword-based but contextually rooted in the logic and ology across multiple grading attempts of the same program-
syntax that the rubric required. ming assignments?
Upon completion of the assessment for each criterion, ChatGPT- Section 4 below discusses our recommendations for integrating
4 aggregated individual assessments into a final summary. This our GreAIter grading methodology as a semi-automated grader in
summary conveyed areas where the student excelled, as well as CS classes. For this experiment, however, we evaluated GreAIter’s
areas that require further improvement. In addition, this summary performance in isolation to provide evidence for our recommen-
provided a foundational tool for human graders to either validate dation. While GreAIter could be used to fully automate grading,
the results of GreAIter’s grading process or to provide additional we use our experiments to determine GreAIter’s failure modes to
insights where necessary. evaluate its efficacy and determine how an AI-assisted grader can
To maintain the integrity of the assessment, we also included a verify results.
review mechanism where the outputs generated by GreAIter were
cross-examined by human graders. This dual-layered approach not 3.1 RQ1: Performance
only fine-tuned the assessment process but also established a com-
Building upon the experimental design described above, we used
prehensive feedback system that benefited the students’ learning
three performance metrics to evaluate GreAIter rigorously. First,
experience. Our goal was to harness the computational precision
we used GreAIter’s accuracy, which we quantified as the percentage
and scalability of ChatGPT-4 while retaining the nuanced judg-
of student mistakes correctly identified by GreAIter in alignment
ment of humans, striving for an equilibrium that augments the
with the consensus grades established by the TA grader and in-
grading process within Vanderbilt University and other educational
structor. High accuracy results would validate GreAIter as a reliable
environments.
evaluator of code quality and correctness, thereby motivating its
integration into the grading process to reduce the grading load on
3 EXPERIMENT DESIGN AND EVALUATION
instructors while maintaining high assessment standards.
To evaluate our methodology described in Section 2, we designed Second, we used GreAIter’s precision, which we quantified as
an experiment to empirically determine how well ChatGPT-4 per- ChatGPT-4’s tendency to incorrectly mark a correct code segment
formed in its assessment of student parallel functional program- as erroneous. This metric is crucial because high precision indicates
ming assignments. The objective of this experiment was to assess GreAIter rarely marks correct code as erroneous, preventing undue
the performance of a ChatGPT-4-based automated code assessor penalties on students and reducing the need for human oversight.
against human graders in terms of accuracy, efficiency, and objec- High precision indicates GreAIter’s meticulousness, ensuring its
tivity. The experimental setup described in this section measured feedback is constructive and based on actual student errors, thereby
the efficacy of GreAIter by comparing its assessment outcomes to maintaining student trust in this assessment process. While poor
those of experienced human graders. precision would impact ChatGPT-4’s ability to operate in isolation,
Assignments and student submissions for this experiment were it could be mitigated with additional TA grader intervention as part
obtained from our parallel functional programming course at Van- of the overall GreAIter assessment process.
derbilt University, which consisted of 26 undergraduate and gradu- Third, we use recall, which we qualitifed as GreAIter’s ability
ate students in the fall semester of 2023. We considered results from to identify all incorrect code present. A high recall rate indicates
three assignments given to this class cohort and used final student GreAIter can effectively detect most—if not all—errors in student
grades on the assignments as the "ground-truth" human-graded
4 Few-shot learning involves training an AI model from only a few examples, which 5We assumed that a three-way agreement between the TA grader, the instructor, and
typically allows the model to perform better than it would in a 0-shot approach [22], GreAIter is most likely accurate, so we do not investigate cases where such a consensus
which has proven effective for LLMs [12]. is reached.

105
Applying Large Language Models to Enhance the Assessment
of Parallel Functional Programming Assignments LLM4Code ’24, April 20, 2024, Lisbon, Portugal

submissions, which is critical because identifying candidate mis- A precision of 59.30% implies that when GreAIter identifies an
takes demonstrates ChatGPT-4’s ability to assist human graders. error, it is correct just over half of the time, indicating a tendency
In contrast, a low recall rate would require human oversight to towards false positives, where GreAIter incorrectly marks code as
a degree that GreAIter would not substantially accelerate the as- erroneous. While GreAIter is thorough, therefore, it may be overly
sessment process, particularly at scale as CS class sizes increase. If critical or prone to hallucination, i.e., by perceiving errors that are
GreAIter exhibited high recall its utility in providing comprehen- not there. More optimistically, GreAIter exhibited perfect recall in
sive and thorough feedback necessary for educational purposes is this trial, meaning it never missed a mistake made by a student.
enhanced, i.e., it can consistently identify mistakes (which can then Overall, these results suggest that while GreAIter shows promise
be verified quickly by human graders). in terms of high accuracy and recall, its should be managed carefully
due to its propensity for false positives. The strong recall indicates
Table 1: Confusion Matrix of LLM-Produced Grades. that GreAIter can serve as an effective initial filter in identifying
AI Grader Results Compared to Human Graders potential errors in student submissions. However, the precision
underscores the necessity of human oversight to confirm Chat-
True False
GPT’s findings and to provide the final judgment on the student’s
+ 2.61% 1.79% work, which substantiates our focus on a semi-automated grading
approach.
− 95.60% 0%
3.2 RQ2: Efficiency
A summary of intermediate statistics is shown in Table 1, which To address the second research question, we focused on evaluat-
depicts positives (+) and negatives (−) found by GreAIter and shows ing the efficiency of our LLM-based grading system compared to
promising statistics with a potential failure mode being present in traditional manual grading methods. In this context, we defined
false positives, to which GreAIter is prone. The accuracy of GreAIter efficiency by (1) the time investment required for grading, (2) the
was benchmarked against the grades determined by the TA grader number of rubric criteria a grader must assess, and (3) the volume
and validated by the course instructor, which yielded the following of code that must be reviewed for each submission.
results: 3.2.1 Time Investment. Based on our experience applying GreAIter
• True Positives (TP): 2.61% - The percentage of instances throughout the fall semester of 2023, we found that the semi-
where GreAIter correctly identified errors that were also automated process for grading was notably faster than manual
recognized by the TA grader. grading. In particular, we observed that our grader’s runtime aver-
• True Negatives (TN): 95.60% - The percentage of instances age roughly one minute per student submission due to the request
where GreAIter correctly identified correct code segments, latency of the OpenAI API. However, GreAIter could simultane-
aligning with the TA grader’s assessments. ously assess all submissions for a given assignment, so it could run
• Overall Accuracy: 98.21% - The proportion of correct as- as a background process while the TA grader reviewed the results.
sessments made out of all grading decisions. The runtime of GreAIter thus had a negligible effect on overall
The precision and recall of GreAIter reflected its ability to min- grading efficiency.
imize false positives and false negatives, both crucial aspects of The time needed to grade each submission is a critical measure of
ensuring fair and constructive feedback: efficiency. We tracked the duration it took for a human grader and
• False Positives: 1.79% - The percentage of instances where our semi-automated GreAIter approach to complete the grading
GreAIter marked correct code segments as erroneous. process for all student submissions in a given assignment. Our
• False Negatives: 0% - The percentage of instances where observations indicated that GreAIter substantially reduced the time
GreAIter marked erroneous code segments as correct. required per submission. In particular, the average time taken by
• Precision: 59.30% - GreAIter’s ability to correctly identify GreAIter to assess all submissions for a single assignment was
student mistakes without over-penalizing correct aspects of roughly 45 minutes, or 1.73 minutes per student submission.
their submissions. In contrast, the TA grader spent an average of just under 4 hours
• Recall: 100% - The comprehensiveness of GreAIter in detect- to grade all submissions for an assignment, or 9.23 minutes per
ing errors present in the student submissions. submission. We therefore found that our approach reduced overall
The results of applying our semi-automated GreAIter tool re- grading time by approximately 81.2%, which highlights the potential
vealed a nuanced performance profile across the metrics presented of GreAIter to enhance grading efficiency in educational settings.
above. Our high overall accuracy rate indicates that GreAIter aligned This increased efficiency would be even more apparent in much
well with the TA grader performance in the vast majority of cases. larger CS classes that required multiple TA graders, and would also
This finding suggests a strong foundational reliability of ChatGPT-4 address the inter-rater reliability problems that would likely arise
in evaluating parallel functional programming assignments. with multiple TA graders.
Interestingly, the seemingly low true positive rate of 2.61% indi-
cates we had somewhat skewed data, with considerably more true 3.2.2 Number of Rubric Criteria. Another dimension of efficiency
negatives than true positives. This result is not unexpected since we investigated was the number of rubric criteria that a grader must
student grades on programming assignments in this course tend to check. In the traditional manual method, a TA grader must assess
average over 90%. However, it does render our true positive rate each criterion for every submission. In contrast, GreAIter’s ability
somewhat meaningless. to quickly identify correct code segments reduced the number of

106
LLM4Code ’24, April 20, 2024, Lisbon, Portugal Skyler Grandel, Douglas C. Schmidt, and Kevin Leach

criteria requiring detailed review. We quantified the average number This consistency rate informs us about the repeatability of GreAIter’s
of rubric criteria the TA grader had to assess in-depth for each performance. Although this rate is not perfect, it is substantial and
submission compared to those GreAIter those flagged for further indicates that GreAIter can reliably reproduce its grading decisions
review. Our findings showed that GreAIter required attention to across multiple iterations. The inconsistencies we observed were
roughly 1.1 rubric criteria per student submission on average, which due to false positives, reinforcing our earlier observation regard-
was substantially less than the average of 25 criteria for the TA ing GreAIter’s tendency to over-diagnosis errors. The nature of
grader. Overall, this constituted a substantial decrease of 95.6%. GreAIter’s inconsistencies are thus consistent with our previous
findings that underscore the necessity of human oversight to con-
3.2.3 Volume of Code. Finally, we evaluated efficiency in terms of firm GreAIter’s findings and to provide the final judgment on the
the total number of lines of code a grader needed to check to grade student programming submissions.
a submission. GreAIter’s ability to precisely target relevant code This semi-autonomous "augmented intelligence" approach (i.e.,
segments for each rubric criterion reduced the overall volume of where GreAIter provides a first pass and humans verify) offers
code that needs in-depth review. This property of GreAIter resulted a balanced solution, combining the thoroughness and speed of
from its ability to find the relevant method to each rubric criterion LLMs like ChatGPT-4 with the discernment and expertise of human
within student code. GreAIter thus only needed to check methods graders. This hybrid strategy helped us streamline the grading
that ChatGPT-4 highlighted as relevant to a rubric criterion that a process, reduce the workload for instructors and TAs, and maintain
student missed. the integrity and fairness expected of academic evaluations.
We compared the average number of lines of code reviewed per
submission by GreAIter and the TA grader. GreAIter reviewed an 4 ANALYSIS OF RESULTS
average of 28 lines of code per student submission. In contrast, the This section analyzes the results of our semi-automated GreAIter
TA grader reviewed an average of 409 lines of code, constituting a tool, focusing on the implications of its performance metrics, its
93.2% decrease in volume. potential role and integration within educational settings, and con-
The results from our experiments indicate a substantial improve- siderations for its future application. A summary of our experimen-
ment in grading efficiency when using ChatGPT-4 as the LLM tal results using the TA grader as the ground truth is presented in
for the GreAIter process. The reduction in time per-submission— Table 2.
coupled with fewer rubric criteria requiring in-depth review and Table 2: Summary of Results.
a lower volume of code to scrutinize—significantly reduced our
manual grading workload. This efficiency does not come at the cost Accuracy Precision Recall
Performance
of grading quality since GreAIter still adhered to the grading stan- 98.21% 59.30% 100%
dards we established. By freeing up time and resources, moreover, Time Rubric Criteria Code Volume
we could focus more on providing quality feedback, engaging in Effort Reduction
81.2% 95.6% 93.2%
interactive teaching, and developing better course content, thereby
enriching overall student educational experience. Consistency Rate
Consistency
78%
3.3 RQ3: Consistency
To assess the reliability of GreAIter, we implemented a repeatabil- Performance metrics and efficiency gains. GreAIter’s high
ity test, which provided a crucial measure of our grader’s consis- overall accuracy (98.21%) and recall (100%) indicate its potential as
tency over time. In this context, consistency refers to the ability of a useful tool and process in programming assignment assessment
GreAIter to produce the same results when presented with the same and grading. Its effectiveness in correctly identifying correct code
inputs under similar conditions. This ability is vital for its potential submissions significantly reduced TA grader workload by filtering
deployment in educational settings and ensures our tool’s accuracy out submissions that likely required no further review. However,
results are not the result of a chance response from ChatGPT-4. the precision of 59.30% raises concerns regarding its number of false
The primary metric for success in the repeatability test is the positives. This result reflects the current limitations of ChatGPT-4,
consistency rate, i.e., the percentage of identified mistakes that re- which, while sophisticated, can still misinterpret complex instruc-
main unchanged across multiple grading attempts by GreAIter. A tions or code nuances, leading to the incorrect identification of
high consistency rate indicates that GreAIter is stable and reliable errors.
in both its grading and feedback and exhibits minimal inter-rater This outcome led us to applying a semi-automated grading and
reliability bias, which is essential for any (semi-)automated grading feedback approach, which leverages GreAIter to find candidate
tool used in academia. mistakes in student code for subsequent TA grader review. Since
The consistency rate we observed was 78%, which was the ratio GreAIter outputs the relevant code for each mistake, TA graders can
of rubric criteria for which GreAIter gave the same result to the quickly validate candidates. This semi-automated process can ac-
same student submission over two grading attempts. However, celerate the grading process and reduce TA grader effort, as shown
all identified disagreements were false positives, i.e., in one trial by our efficiency study in Section 3.2, while maintaining human
GreAIter hallucinated a problem while it did not hallucinate or intervention to mitigate student distrust of AI-based systems.
hallucinated a different problem in the other trial. This finding Repeatability and semi-automation. The repeatability rate
highlights the stability of GreAIter and the minimization of grader of 78% suggests that while GreAIter is generally reliable, there are
bias. some variations in its grading across iterations. This variability

107
Applying Large Language Models to Enhance the Assessment
of Parallel Functional Programming Assignments LLM4Code ’24, April 20, 2024, Lisbon, Portugal

is particularly problematic in the context of CS courses, where assignment. Furthermore, the repeatability test, while designed to
consistency in grading programming assignments is paramount to be rigorous, was confined to two trials. More extensive testing over
fairness and credibility of the assessment process. However, because additional trials could provide a deeper understanding of GreAIter’s
the disagreement between trials stems entirely from false positives, consistency and reliability.
we can use GreAIter to focus TA grader attention on reducing bias Despite these limitations and threats to validity, our study pro-
and improving grading efficiency by decreasing grading time by up vides valuable insights into the capabilities and limitations of using
to 81.2%. ChatGPT-4 for assessing programming assignments. The high ac-
These results also indicate that GreAIter can contribute effec- curacy and recall offer evidence of GreAIter’s potential utility in
tively to the grading process, primarily through initial assessments CS courses, and its consistency rate, is promising, though imper-
and identification of clear-cut cases of correct code. Its high accu- fect. The careful design of our study, the systematic approach to
racy in these instances could enable instructors to allocate more data collection and analysis, and the critical evaluation of results
time to providing in-depth feedback where it is most needed, po- all contribute to the robustness of our findings. These limitations
tentially enhancing the educational experience for students. Nev- provide a clear framework for understanding the context within
ertheless, GreAIter’s propensity for false positives necessitates a which the findings are applicable.
semi-automated approach where human graders perform a sec-
ondary review of its grading decisions. This approach leverages the 6 RELATED WORK
strengths of GreAIter in rapidly processing and evaluating submis- LLMs and AI-assisted education are a active areas of research that
sions while mitigating its weaknesses through human oversight. we build upon by applying prompt engineering techniques to LLMs
to facilitate automated assessments of student programming assign-
5 LIMITATIONS AND THREATS TO VALIDITY ments. This section compares our research with related work in
This section explores the limitations of the GreAIter AI-assisted the fields of prompt engineering and AI-assisted education.
grader described in Section 2 and the threats to validity of our Several studies have made use of prompt engineering to im-
experiments described in Section 3. prove the performance of LLMs, from simple prompt strategies [1]
Generalizability across languages and paradigms. Although to more complex ones [24, 26]. Wei et al. [23] have investigated
our study is extensive, it is not without limitations. In particular, “chain-of-thought” prompting, which is an approach we apply in
we developed and evaluated GreAIter within a single course (par- our work. Yao et al. have improved on this work by including action
allel functional programming) and programming language (Java), plan generation and external source lookup [26]. White et al.[24]
which may limit the generalizability of our findings. For example, developed prompt patterns, analogous with software patterns, that
Java programming, and more specifically, the parallel functional can improve and structure LLM outputs and they have followed
programming paradigm, has some unique challenges and patterns this with a study on prompt patterns for improving code quality in
that may not be representative of other programming languages or particular [25]. Common failure modes for LLMs have been identi-
paradigms. fied as well, necessitating the development of improvements and
Subjectivity in ground truth and precision concerns. We mitigations [3].
compared GreAIter’s performance to that of a single TA grader and Suggestions for future use of LLMs guide our study as well. We
course instructor, which may introduce a degree of subjectivity strive to follow guidelines put forth by van Dis et al. [21], particu-
into the “ground truth” against which GreAIter’s performance was larly suggestions to “embrace the benefits of AI” and “hold on to
measured. Although the TA grader was quite experienced—and all human verification”. We leverage insights from this and other re-
grades were reviewed for accuracy—different graders may have lated work to develop a novel programming assignment assessment
different thresholds for correctness and error severity, potentially methodology to speed up grading and mitigate inter-grader bias.
influencing the benchmarks used for AI evaluation. It may therefore Other researchers have explored applications of AI in education
be prudent to verify our results with multiple graders and courses, specifically. Most similar to our work is a study on automated
along with a comparison to human inter-grader reliability. grading of short answer questions using LLMs [19], which also
The false positives reported in GreAIter’s outcomes reflect an- found that an AI grader necessitates human oversight, though it is
other limitation. While its recall was perfect (indicating it missed still helpful for maximizing grading efficiency. LLMs were applied
no errors) its precision was much lower, suggesting GreAIter some- for short answer grading as a followup [18] to a study on the same
times identified errors to zealously. While this over-detection erred task using fine-tuned transformer models, which outperformed
on the side of caution, it led to unnecessary reviews by human other automated attempts but did not achieve sufficient accuracy
graders, diminishing the efficiency gains from using GreAIter. for full integration. Several studies [2, 9, 16, 20] have speculated
Consistency and repeatability concerns. Another potential on the benefits and pitfalls of LLMs and AI in education to guide
threat to the validity of the study is our reliance on the OpenAI future research, such as our study.
API’s latency, which may affect the grading speed results. Like- Detection of AI-generated submissions for education [14], gen-
wise, the consistency measure assumes that ChatGPT-based LLMs eration of programming exercises and code explanations [17], and
will not learn or adapt over time, which may not hold true as its assistance in medical education using LLMs [11] have also been
underlying models are continuously updated and improved upon. studied. This related work showcases the potential of LLMs for
However, the importance of this metric is in verifying the consis- education, which helped guide our study. Finally, there is a long
tency of grades within a single cohort for a single assignment, and tradition of automated grading of programming assignments using
updates to the base model should not be variable within a single test suites and linting for correctness and style [6, 8, 15]. We build

108
LLM4Code ’24, April 20, 2024, Lisbon, Portugal Skyler Grandel, Douglas C. Schmidt, and Kevin Leach

on this related work by using ChatGPT-4 to overcome limitations in While GreAIter is powerful, it is not yet capable of replac-
qualitative assessment, thereby enabling automated assessment of ing human judgment in tasks that require nuanced under-
efficient implementations, adequate documentation, and a broader standing. The semi-automated approach advocated by our
range of stylistic issues. research—where ChatGPT-4 performed an initial assessment
Our study builds upon the rich and growing body of research and humans provide final verification and feedback—strikes
in the realms of prompt engineering, AI-assisted education, and a balance that leverages the strengths of both. We found this
automated grading systems. We acknowledge the various methods semi-automated approach reduced grading workloads, consti-
and applications explored in these fields, ranging from improving tuting a 93.2% decrease in code volume to review and an 81.2%
LLM outputs through intricate prompt designs to leveraging AI for decrease in grading time. We also found GreAIter yielded a
educational purposes. Our approach contributes to this evolving consistency rate of 78%, thereby indicating that while LLMs
landscape by applying sophisticated prompt engineering techniques can exhibit consistent, relatively unbiased tendencies, their
to GPT-4 for the specific task of assessing programming assign- performance can vary, and thus should be regularly checked
ments. This novel application not only addresses the challenges of for consistency and accuracy.
efficient and minimally-biased grading but also encapsulates the • GreAIter can be integrated into actual classroom set-
potential of AI in enhancing the educational experience. tings to improve grader efficiency and reliability. Through
careful planning and systematic analysis of the accuracy, pre-
7 CONCLUDING REMARKS cision, recall, efficiency, and repeatability metrics covered in
This paper presented the results of our study that applied ChatGPT- this paper, we showed that GreAIter improves traditional TA
4 to create GreAIter, which is an AI-assisted tool that helps automate grading. We validated the feasibility of integrating GreAIter
key portions of the grading process for programming assignments into an actual classroom setting, optimizing the grading pro-
in an advanced parallel functional programming course offered cess in terms of both efficiency and scalability. While GreAIter
in Java. Our findings codify the potential and current limitations demonstrates a high degree of accuracy, its current limita-
of AI-assisted grading systems and yielded the following lessons tions underscore the need for a semi-automated approach
learned: that combines the speed and consistency of LLMs with the
• ChatGPT-4 has the capacity to accurately identify cor- critical thinking and expertise of human graders. As LLMs
rect code submissions. We demonstrate that GreAIter could evolve, so too will the strategies for integrating it more effec-
achieve a high accuracy rate (98.21%) and perfect recall. These tively to enhance quality and fairness of the grading process
results suggest that LLMs can play an important role in as- for CS courses.
sisting with grading tasks, particularly in filtering out rubric Overall, our case study shows that the promise of LLMs in educa-
criteria that are likely correct for a given submission, thereby tion extends beyond grading efficiency, i.e., LLMs have the potential
reducing human grader workload. Ironically, when ChatGPT- to reshape how feedback is delivered, how learning is assessed, and
4 did not accurately identify correct code submissions, we how education is ultimately conducted. While LLMs have not yet
interacted with it and got it to explain how we could craft reached the point of replacing human graders, they provide an
future prompts to elicit more accurate results from it. The abil- important resource to aid educators, particularly in disciplines like
ity to engage in such a "Socratic dialogue" with an LLM like CS that are characterized by ever-growing class sizes. In keeping
ChatGPT-4 was quite refreshing compared with traditional with previous research [24, 25], we find that collaboration between
means of refining queries with conventional static analysis human users and AI tools results in rapid and reliable software so-
tools. lutions, and we leverage this collaboration for a more efficient and
• The need for human oversight remains critical. The effective educational process. As LLMs grow more sophisticated,
precision of 59.3%, marked by a substantial rate of false posi- we anticipate further research to refine and harness these powerful
tives, points to the limitations of the current state of LLMs in tools for the betterment of educational systems.
understanding and evaluating complex programming tasks. Looking forward, the integration of LLMs into grading CS pro-
Therefore, despite GreAIter’s impressive recall rate (indicat- gramming assignments requires consideration of the trade-offs
ing no missed errors) human oversight remains necessary between efficiency and accuracy. The false positive rate must be
due to ChatGPT-4’s tendency to over-flag student code seg- reduced to make tools like GreAIter more autonomous and trust-
ments as erroneous. By integrating insights from previous worthy. Our future work will focus on fine-tuning LLMs on larger
work with GreAIter, we extend the capabilities of LLMs in and more diverse datasets of code submissions and rubrics, po-
educational contexts and set the stage for future research tentially improving their understanding and reducing the rate of
in AI-assisted education. The synergy of LLMs and human false positives. Likewise, we plan to explore the use of ensemble
expertise demonstrated in this study showcases the poten- methods, which combine multiple LLMs and/or AI-assisted graders
tial of LLMs in enhancing educational methodologies and to cross-verify results and improve grading consistency. Many false
outcomes. positives result from the same rubric criteria, so investigations into
• Due to ChatGPT-4’s limitations, we stress the benefits prompting strategies for specific rubric criteria is also likely to
of a semi-automated grading approach. A key lesson improve precision.
learned through our study is the importance of human-AI The limitations described in Section 5 also offer directions for
collaboration, which is commonly known as "augmented in- future research. For example, our future work will explore a wide
telligence" rather than conventional "artificial intelligence". range of courses, broader coverage of programming languages,

109
Applying Large Language Models to Enhance the Assessment
of Parallel Functional Programming Assignments LLM4Code ’24, April 20, 2024, Lisbon, Portugal

more diverse grading benchmarks, as well as other LLMs beyond [16] Junaid Qadir. 2023. Engineering education in the era of ChatGPT: Promise and
ChatGPT-4. By understanding the specific contexts in which GreAIter pitfalls of generative AI for education. In 2023 IEEE Global Engineering Education
Conference (EDUCON). IEEE, 1–9.
performs well, and those in which it does not, we can better tailor [17] Sami Sarsa, Paul Denny, Arto Hellas, and Juho Leinonen. 2022. Automatic
the this type of AI-assisted grader to meet instructor needs. Thus, generation of programming exercises and code explanations using large language
models. In Proceedings of the 2022 ACM Conference on International Computing
while our case study is bound by certain limitations and potential Education Research-Volume 1. 27–43.
threats to validity, our methodology and the GreAIter’s overall [18] Johannes Schneider, Robin Richner, and Micha Riser. 2023. Towards trustworthy
solid performance in several key metrics support the validity of autograding of short, multi-lingual, multi-type answers. International Journal of
Artificial Intelligence in Education 33, 1 (2023), 88–118.
our findings. The study’s design and the presented results provide [19] Johannes Schneider, Bernd Schenk, Christina Niklaus, and Michaelis Vlachos.
a foundation upon which future work can build, contributing to 2023. Towards LLM-based Autograding for Short Textual Answers. arXiv preprint
the evolving field of LLMs in CS education and the development of arXiv:2309.11508 (2023).
[20] Kehui Tan, Tianqi Pang, and Chenyou Fan. 2023. Towards Applying Powerful
more sophisticated, reliable, and efficient AI-assisted graders. Large AI Models in Classroom Teaching: Opportunities, Challenges and Prospects.
arXiv preprint arXiv:2305.03433 (2023).
8 ACKNOWLEDGEMENTS [21] Eva AM Van Dis, Johan Bollen, Willem Zuidema, Robert van Rooij, and Claudi L
Bockting. 2023. ChatGPT: five priorities for research. Nature 614, 7947 (2023),
We applied ChatGPT-4 extensively to implement GreAIter and per- 224–226.
form the automated assessment of student programming assign- [22] Yaqing Wang, Quanming Yao, James T Kwok, and Lionel M Ni. 2020. Generalizing
from a few examples: A survey on few-shot learning. ACM computing surveys
ments described in this paper. We also applied ChatGPT-4 to check (csur) 53, 3 (2020), 1–34.
the spelling and grammar of this paper. [23] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia,
Ed Chi, Quoc Le, and Denny Zhou. 2023. Chain-of-Thought Prompting Elicits
Reasoning in Large Language Models. arXiv:2201.11903 [cs.CL]
REFERENCES [24] Jules White, Quchen Fu, Sam Hays, Michael Sandborn, Carlos Olea, Henry Gilbert,
[1] Simran Arora, Avanika Narayan, Mayee F Chen, Laurel Orr, Neel Guha, Kush Ashraf Elnashar, Jesse Spencer-Smith, and Douglas C Schmidt. 2023. A prompt
Bhatia, Ines Chami, Frederic Sala, and Christopher Ré. 2022. Ask me anything: A pattern catalog to enhance prompt engineering with chatgpt. arXiv preprint
simple strategy for prompting language models. arXiv preprint arXiv:2210.02441 arXiv:2302.11382 (2023).
(2022). [25] Jules White, Sam Hays, Quchen Fu, Jesse Spencer-Smith, and Douglas C. Schmidt.
[2] David Baidoo-Anu and Leticia Owusu Ansah. 2023. Education in the era of 2023. ChatGPT Prompt Patterns for Improving Code Quality, Refactoring, Re-
generative artificial intelligence (AI): Understanding the potential benefits of quirements Elicitation, and Software Design. arXiv:2303.07839 [cs.SE]
ChatGPT in promoting teaching and learning. Journal of AI 7, 1 (2023), 52–62. [26] Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan,
[3] Ali Borji. 2023. A categorical archive of chatgpt failures. arXiv preprint and Yuan Cao. 2023. ReAct: Synergizing Reasoning and Acting in Language
arXiv:2302.03494 (2023). Models. arXiv:2210.03629 [cs.CL]
[4] Julio C Caiza and Jose M Del Alamo. 2013. Programming assignments automatic [27] Yongchao Zhou, Andrei Ioan Muresanu, Ziwen Han, Keiran Paster, Silviu Pitis,
grading: review of tools and implementations. INTED2013 Proceedings (2013), Harris Chan, and Jimmy Ba. 2022. Large language models are human-level
5691–5700. prompt engineers. arXiv preprint arXiv:2211.01910 (2022).
[5] Anita Carleton, Mark Klein, John Robert, Erin Harper, Robert Cunningham,
Dionisio de Niz, John Foreman, John Goodenough, James Herbsleb, Ipek Received 7 December 2023; accepted 15 January 2024
Ozkaya, Douglas Schmidt, and Forrest Shull. 2021. Architecting the Future of
Software Engineering: A National Agenda for Software Engineering Research
Development. https://insights.sei.cmu.edu/library/architecting-the-future-of-
software-engineering-a-national-agenda-for-software-engineering-research-
development/ Accessed: 2023-Dec-7.
[6] Brenda Cheang, Andy Kurnia, Andrew Lim, and Wee-Chong Oon. 2003. On
automated grading of programming assignments in an academic institution.
Computers & Education 41, 2 (2003), 121–131.
[7] Binglin Chen, Sushmita Azad, Rajarshi Haldar, Matthew West, and Craig Zilles.
2020. A validated scoring rubric for explain-in-plain-english questions. In Pro-
ceedings of the 51st ACM Technical Symposium on Computer Science Education.
563–569.
[8] Chase Geigle, ChengXiang Zhai, and Duncan C Ferguson. 2016. An exploration
of automated grading of complex assignments. In Proceedings of the Third (2016)
ACM Conference on Learning@ Scale. 351–360.
[9] Wayne Holmes, Kaska Porayska-Pomsta, Ken Holstein, Emma Sutherland, Toby
Baker, Simon Buckingham Shum, Olga C Santos, Mercedes T Rodrigo, Mutlu
Cukurova, Ig Ibert Bittencourt, et al. 2021. Ethics of AI in education: Towards
a community-wide framework. International Journal of Artificial Intelligence in
Education (2021), 1–23.
[10] Stephen C Johnson. 1977. Lint, a C program checker. Bell Telephone Laboratories
Murray Hill.
[11] Tiffany H Kung, Morgan Cheatham, Arielle Medenilla, Czarina Sillos, Lorie
De Leon, Camille Elepaño, Maria Madriaga, Rimel Aggabao, Giezel Diaz-Candido,
James Maningo, et al. 2023. Performance of ChatGPT on USMLE: Potential for
AI-assisted medical education using large language models. PLoS digital health 2,
2 (2023), e0000198.
[12] Grégoire Mialon, Roberto Dessì, Maria Lomeli, Christoforos Nalmpantis, Ram
Pasunuru, Roberta Raileanu, Baptiste Rozière, Timo Schick, Jane Dwivedi-Yu, Asli
Celikyilmaz, et al. 2023. Augmented language models: a survey. arXiv preprint
arXiv:2302.07842 (2023).
[13] OpenAI. [n. d.]. ChatGPT. https://chat.openai.com/
[14] Michael Sheinman Orenstrakh, Oscar Karnalim, Carlos Anibal Suarez, and
Michael Liut. 2023. Detecting llm-generated text in computing education: A
comparative study for chatgpt cases. arXiv preprint arXiv:2307.07411 (2023).
[15] James Perretta, Westley Weimer, and Andrew DeOrio. 2019. Human vs. automated
coding style grading in computing education. In 2019 ASEE Annual Conference &
Exposition.

110

You might also like