Effects of generative artificial intelligence on cognitive effort and task performance study protocol for a randomized controlled experiment among college students
Effects of generative artificial intelligence on cognitive effort and task performance study protocol for a randomized controlled experiment among college students
Page 1/28
Research Article
Keywords: Generative artificial intelligence, randomized controlled trial, human cognition, cognitive
effort, creativity, analytical writing, eye-tracking, functional near-infrared spectroscopy
DOI: https://doi.org/10.21203/rs.3.rs-5557709/v1
License: This work is licensed under a Creative Commons Attribution 4.0 International License.
Read Full License
Page 2/28
Abstract
Background: The advancement of generative artificial intelligence (AI) has shown great potential to
enhance productivity in many cognitive tasks. However, concerns are raised that the use of generative AI
may undermine human cognition due to over-reliance. Conversely, others argue that generative AI holds
the promise to augment human cognition by automating menial tasks and offering insights that extends
one’s cognitive abilities. To better understand the role of generative AI in human cognition, we study how
college students use a generative AI tool to support their analytical writing. We will examine the effect of
using generative AI on cognitive effort, a major aspect of human cognition that reflects the extent of
mental resource an individual allocates during the cognitive process. We will also examine the effect on
writing performance achieved through the human-AI collaboration.
Methods: This study is a randomized controlled lab experiment that compares the effects of using
generative AI (intervention group) versus not using it (control group) on human cognition and writing
performance in an analytical writing task designed as a hypothetical writing class assignment for
college students. During the experiment, eye-tracking technology will monitor eye movements and pupil
dilation. Functional near-infrared spectroscopy (fNIRS) will collect brain hemodynamic responses. A
survey will measure individuals’ perceptions of the writing task and their attitudes on generative AI. We
will recruit 160 participants (aged 18-35 years) from a German university where the research will be
conducted.
Discussion: This trial aims to establish the causal effects of generative AI on human cognition and task
performancethrough a randomized controlled experiment. The findings aim to offer insights for
policymakers in regulating generative AI and inform the responsible design and use of generative AI
tools.
Administrative information
Note: the numbers in curly brackets in this protocol refer to SPIRIT checklist item numbers. The order of
the items has been modified to group similar items (see http://www.equator-network.org/reporting-
guidelines/spirit-2013-statement-defining-standard-protocol-items-for-clinical-trials/).
Page 3/28
Title {1} Effects of generative artificial intelligence on cognitive effort and task
performance: study protocol for a randomized controlled experiment
among college students
Author details {5a} Youjie Chen: Department of Information Science, Cornell University, USA
Bin Lu: Chinese Academy of Medical Sciences and Peking Union Medical
College, China
Page 4/28
Introduction
Background and rationale {6a}
Recent advances in generative artificial intelligence (AI) have raised heated debates regarding its use in
performing cognitive tasks. Human collaboration with generative AI tools, such as OpenAI’s ChatGPT,
has been shown to enhance productivity across a wide range of cognitive tasks, including professional
writing tasks among white-collar workers [1], customer support services [2], knowledge-intensive
consulting [3], creative story-writing [4], and creative ideation [5]. However, concerns have been raised
that heavy use of these tools may lead to the erosion of human cognition [6, 7], which has important
implications for human cognitive health [8].
Many prior technological innovations have raised similar concerns about potentially causing a
deleterious effect on human cognition and cognitive health. For example, the use of calculators may
hinder arithmetic literacy, the use of search engines may reduce aspects of memory skills [9], and the
use of social media may contribute to everyday cognitive lapses [10]. According to these concerns,
access to these tools may allow individuals to bypass effortful tasks and thus reduce opportunities to
engage in the mental practice required for cognitive abilities to fully develop in the human brain [11, 12].
However, technologies could also be seen as an extension of human cognition, or the so-called
“extended mind” [13]. With appropriate cognitive offloading, technologies can extend the limits of human
cognition and become an active component of human brain mechanisms [14, 15]. For example, the use
of calculators can help individuals circumvent tedious arithmetic calculations and focus on complex
mathematical problems. The use of search engines can stimulate learning by broadening individuals’
knowledge space and providing tools for self-regulated learning [16]. In the end, the effect of technology
tools on human cognition is a nuanced problem that depends on the cognitive task, the tool itself, and
how it is used.
The emergence of generative AI tools has again raised heated debates about the effects of the new
technology on human cognition due to its significant advancements over their antecedents [17]. These
advancements include the following: First, unlike traditional tools that assist with basic skills, such as
calculators, generative AI exhibits a higher level of intelligence to create ideas and construct arguments.
Second, generative AI encompasses a broad range of cognitive skills rather than a well-defined single
one. Consequently, it is difficult to pinpoint which cognitive skills generative AI may affect. Third,
generative AI is continuously developing at an ever-increasing speed, which complicates our ability to
predict the kinds of cognitive skills it may affect in the future. In consideration of all these advancements
and their implications, it is important to evaluate the effects of generative AI tool use on human
cognition.
Recent studies have begun to shed light on how generative AI may affect human cognition, mainly
through its effects on learning performance outcomes [18]. Randomized controlled trials have found that
Page 5/28
students performed better when given access to general-purpose generative AI tools but performed
worse when these tools were taken away [19, 20]. This suggests that students may have relied on the
tool to bypass cognitive processes essential for developing cognitive skills, which ultimately
compromised their performance. A subsequent study has found that generative AI boosted learning for
those who use it to engage in deep conversations and explanations but hampered learning for those who
sought direct answers [21]. This finding further highlights the difference between using generative AI as
an active extension of human cognition and using it merely for passive cognitive offloading.
Studies so far have gained preliminary insights into generative AI’s effects on human cognition and
performance through a standard assessment paradigm (SAP) [22]. In their experiments, participants
were randomly assigned to either have or not have access to generative AI, and their skills were then
tested through task performance in isolation from AI support [19–21]. However, this approach only
captures a static snapshot of the learning product but is insufficient to understand the ongoing
developmental process of learning during human-AI interaction [23]. To gain a deeper understanding of
generative AI’s effects on human cognition, it is important to develop measures during the interaction
process. In contrast to learning products, the learning process can indicate authentic progress over time
and reveal fundamental questions about how learning happens [24]. Additionally, the SAP tends to focus
on memory-based performance outcomes but may overlook activities for long-term cognitive
development that can be better evaluated through process-based behavioral measures [25, 26].
In light of this background, our study will evaluate the effects of generative AI on task performance and
cognitive effort during the interaction process. The task performance will reflect the overall achievement
of an individual executing a specified cognitive task in collaboration with generative AI. Cognitive effort
will reflect the extent to which the individual actively utilizes their mental resources while performing the
task. Exerting cognitive effort is fundamental to training one’s cognitive abilities and maintaining the
fitness of the human brain [11]. When measuring the amount of cognitive effort in a task that involves
generative AI, it can reveal whether cognitive offloading preserves effortful brain activity that is likely
required to enhance one’s cognitive abilities. We will use state-of-the-art technology to evaluate
psychophysiological proxies of cognitive effort throughout the task process in a lab-based randomized
controlled trial (RCT). Specifically, we will use an eye tracker to measure pupil dilation changes and a
functional near-infrared spectroscopy (fNIRS) to measure cortical hemodynamic activity. Our study
context will focus on analytical writing among college students. We choose analytical writing because
this task requires high cognitive effort to develop critical thinking [27, 28], a fundamental higher-order
thinking skill crucial for problem-solving and decision-making [29, 30]. As important as critical thinking
skill is, it remains empirically unclear whether the use of generative AI has implications for the
development of this skill.
Objectives {7}
Page 6/28
1. To establish the effects of generative AI on human cognition and task performance, measured by
cognitive effort during the writing process and analytical writing performance (primary objective).
2. To explore the effects of generative AI on subjective perceptions of health- and learning-related
outcomes.
3. To investigate heterogeneous treatment effects across individuals with different characteristics.
Our study is a parallel randomized controlled trial (RCT) that compares the effects of using ChatGPT
(intervention group) versus not using ChatGPT (control group) in an analytical writing task (Fig. 1).
Participants will be randomized in a 1:1 ratio. The study follows an exploratory framework.
The study consists of three stages. In the first stage, the experimenter will onboard the participant and
ask the participant to sign a consent form. In the second stage, the participant will be invited into an
experiment room to sit in front of a computer with eye-tracking functionality that collects data on eye
movements and pupil size. An experimenter will assist the participant in wearing an fNIRS device that
collects data on brain hemodynamic responses. In the third stage, the actual experiment begins, and the
participant will independently follow the instructions displayed on the computer screen. The participant
will first take a pre-survey. Then they will be asked to read some learning materials on analytical writing
and then practice what they have learned by writing an analytical essay based on a writing prompt.
Participants in the intervention group can use ChatGPT to support their writing. Participants in the
control group will complete the task without AI assistance. After the writing task, the participant will
complete a post-survey. The entire study will last for approximately 1.5 hours for each participant.
This study will be a lab experiment conducted at Heidelberg University in Germany. Participants are
college students who will be recruited through social media platforms, email lists, and flyers. During the
preparation stage, the participant will be guided into an experiment room and instructed to sit in front of
a computer. For the main part of the experiment, the participant will independently follow instructions
displayed on a computer screen administered via an online survey platform (Qualtrics,
https://www.qualtrics.com/).
Eligible participants must be full-time college students, aged 18-35 years old, and have no self-reported
neurological or psychiatric disorders. Participants should be able to read English, as the entire
experiment will be conducted in English. To ensure a minimum level of computer literacy, participants
should use the computer regularly, which is defined as most days of the week. Additionally, participants
Page 7/28
should not wear glasses or have any eye impairment (such as cranial nerve III palsy) to avoid issues with
eye-tracking data collection.
Before the experiment starts, the participant will be given an information sheet and a consent form by
the experimenter. The information sheet will explain the study’s aim, procedures, potential risks and
benefits, compensation, and contact information for the study investigators. The experimenter will
answer any questions that the participant may have before asking for consent. If the participant meets
the inclusion criteria and agrees to participate, they will be asked to sign the consent form, which the
experimenter will counter-sign. The participant will receive the information sheet and a copy of the
consent form. The other copy of the consent form will be retained by the research team. All participants
will be verbally informed that they can withdraw from the study at any time without giving any reason
and without having any negative consequences to their academic studies.
Additional consent provisions for collection and use of participant data and biological specimens {26b}
Interventions
In the control group, as in the intervention group, the computer screen will be set up in a split-screen
format. On the left side, the participant will receive the same instructions on how to write an analytical
essay via the Qualtrics platform. On the right side, instead of ChatGPT, a basic text editor interface will
be displayed. The instructions on the left side will explain to the participant that they can use the text
editor in any way they like to assist their writing. This comparator will keep the split-screen format
consistent between the two groups and ensure that participants in the control group can complete the
writing task with minimal support.
In the intervention group, the computer screen will be set up in a split-screen format. On the left side, the
participant will receive instructions on how to write an analytical essay via the Qualtrics platform. The
instructions will frame the essay as a homework assignment from a writing class that the participant
has taken. Specifically, the instructions will describe the writing requirements and the grading rubric. To
mimic homework assignments in typical educational settings, the instructions will also state that the
participant will receive their grades, the class average grade, and feedback from the instructor. The right
side of the screen will display a blank ChatGPT interface where the participant can prompt questions
and receive answers. The instructions on the left side will also explain to the participant that they can
Page 8/28
use ChatGPT in any way they like to assist their writing and there is no penalty in their writing score for
how ChatGPT is used.
This study is of minimal risk, and we do not anticipate needing to discontinue or modify the allocated
interventions during the experiment. Participants can withdraw from the study at any time of their own
volition. For those who withdraw in the middle of the study, their compensation will be rounded based on
the amount of time they spend in the experiment.
Adherence to the interventions will be high because the procedures are straightforward and will be
clearly explained in the step-by-step instructions on the computer screen. The participant will be alone in
a noise-canceling room during the entire experiment. The participant can reach out to the experimenter
through an intercom if they need any clarification.
Outcomes {12}
The study has two primary outcomes. First, we will measure participants’ writing performance. The
analytical essay writing task is derived from the Analytical Writing section in the Graduated Record
Examinations (GRE), a worldwide standardized computer-based exam developed by the Educational
Testing Service (ETS) [27]. The participants’ writing performance will be scored on a grading rubric
developed for the GRE ranging from 0 to 6 and by an automatic essay-scoring system called e-rater that
is also developed by ETS [31]. We choose to use the ETS writing material for two reasons. First, their
writing task and grading rubrics were established writing material designed to measure critical thinking
and analytical writing skills, and have been used in research as practice materials for writing [32].
Second, OpenAI’s technical report shows that ChatGPT (GPT-4) can score 4 out of 6 (~54th percentile)
on the GRE analytical writing task [33]. This gives us a benchmark for assessing the magnitude of the
performance increase when individuals collaborate with generative AI.
Second, we will measure participants’ cognitive effort during the writing process. Participants’ cognitive
effort will be measured using a psychophysiological proxy—i.e., changes in pupil size [34, 35]. Pupil
diameter and gaze data will be collected using the Tobii Pro Fusion eye-tracker at a sampling rate of 120
Hz. During the preparation stage of the study, the room light will be adjusted so that the illuminance
close to the participants’ eyes is at a constant value of 320 LUX. Baseline pupil diameters will be
Page 9/28
recorded during a resting task in the experiment preparation stage that asks the participant to stare at a
cross that will appear for 10 seconds each on the left, center, and right sections of the computer screen.
During the experiment, pupil diameters will be recorded throughout the writing process.
The study has several secondary outcomes. First, to identify the neural substrates of cognitive effort
during the writing process, we developed an additional psychophysiological proxy, changes in the
cortical hemodynamic activity in the frontal lobe of the brain. Specifically, we will examine hemodynamic
changes in oxyhaemoglobin (HbO) and deoxyhaemoglobin (HbR). Brain activity will be recorded
throughout the writing process using the NIRSport 2 fNIRS device and the Aurora software with a
predefined montage (Fig. 2). The montage consists of eight sources, eight detectors, and eight short-
distance detectors. The eighteen long-distance channels (source-detector distance of 30 mm) and eight
short-distance channels (source-detector distance of 8 mm) are located over the prefrontal cortex (PFC)
and supplementary motor area (SMA) (Fig. 2). The PFC is often involved in executive function (e.g.,
cognitive control, cognitive efforts, inhibition) [36, 37]. The SMA is associated with cognitive efforts [38,
39]. The sampling rate of the fNIRS is 10.2 Hz. Available fNIRS caps sizes are 54 cm, 56 cm, and 58 cm.
The cap size selected will always be rounded down to the nearest available size based on the
participant's head measurement. The cap is placed on the center of the participant’s head based on the
Cz point from the 10-20 system. The writing task can be characterized as a naturalistic study paradigm,
different from block- and event-related paradigms commonly adopted in neuroimaging studies [40, 41].
Unlike self-reported behavioral measurements taken only at the end of the task, the real-time
psychophysiological data collected from the eye-tracker and the fNIRS will also allow us to explore
cognitive state dynamics through the writing process and during self-allocated sub-tasks, such as
reading, writing, and prompting.
Third, we will measure participants’ subjective perceptions of the writing task by self-reported survey
measures in the post-survey (Table 1). We will measure participants’ subjective perceptions of the two
primary outcomes—that is, their self-perceived writing performance and self-perceived cognitive effort.
Self-perceived writing performance will be measured with a one-item scale using the same grading
rubric described in the instructions for their writing task and used in the scoring tool. Self-perceived
cognitive effort will be measured using a one-item scale adapted from the National Aeronautics and
Space Administration-task load index (NASA-TLX) [42, 43]. We will also measure participants’ subjective
perceptions of several mental health-related outcomes, including stress, challenge, and self-efficacy in
writing. Self-perceived stress will be measured using a one-item scale adapted from the Primary
Appraisal Secondary Appraisal scale (PASA) [44, 45]. Self-perceived challenge will be measured using a
one-item sub-scale adapted from the Primary Appraisal Secondary Appraisal scale (PASA) [44, 45]. Self-
efficacy in writing will be measured using a sixteen-item scale that measures three dimensions of writing
self-efficacy: ideation, convention, and self-regulation [46]. Furthermore, we will measure participants’
situational interest in analytical writing. It will be measured in the post-survey using a four-item Likert
scale adapted from the situational interest scale [47]. Additionally, we will measure participants’
behavioral intention to use ChatGPT in the future for essay writing tasks [48].
Page 10/28
Table 1. Scales in the post-survey
Page 11/28
Construct Items Response
self-perceived writing Using the same grading rubric from before, 0 to 6 scale
performance what score do you think your essay should
get (0 being the lowest and 6 being the
highest)?
The time schedule is provided via the schematic diagram below (Fig. 3). The entire experiment will last
for approximately 1-1.5 hours for each participant.
Our primary outcomes include a behavioral measure of writing performance and a neuroscience-based
measure of cognitive effort, assessed through a psychophysiological proxy. We opt to base our sample
size estimation on writing performance but not cognitive effort for two reasons. First, the effect of
generative AI on performance outcomes has been more studied than its effect on cognitive effort.
Recent empirical evidence suggests that the effect size of generative AI on writing tasks ranges around
Cohen’s d = 0.4-0.5, such as [1, 49]. However, we did not find prior evidence on the effect size of
generative AI on cognitive effort using physiological measures. Second, our physiological measure of
cognitive effort is likely to be sufficiently powered once the sample size satisfies our behavioral measure
of writing performance. While writing performance is measured on a zero- to six-point scale after the
writing task is completed, the cognitive effort is repeatedly measured by time series pupil data
throughout the entire writing process lasting for approximately 30 minutes. Repeated outcome
measures generally can enhance statistical power by leveraging within-subject variability, though the
potential autocorrelation among the repeated measures needs to be appropriately accounted for in the
analysis [50]. Pupillometry studies on cognitive efforts, such as the N-back test, typically recruit 20-50
participants with short, repeated, within-subject trials (e.g., [51]). Although our study differs from
common pupillometry because we measure the time dynamics of pupil size changes in a single, long
Page 13/28
trial for each participant, these studies still provide a general estimation of participants needed for
pupillometry studies with repeating outcome measures.
To estimate the required sample size, we conducted a simulation analysis on the intervention effect
using ordinary least squares (OLS) regression. The simulated data assumes normally distributed data,
equal and standardized standard deviations between the two conditions, and an anticipated effect size
of Cohen’s d = 0.45. In the end, our analysis indicated that recruiting a minimum of 160 participants
would be necessary to achieve a statistical power greater than 0.8 under an alpha level of 0.05. The
simulation was implemented in R, and the corresponding code is available at the Open Science
Framework (OSF) via https://osf.io/9jgme/?view_only=1eeba901f67546bd954e53ddec330231.
Recruitment {15}
The recruitment will follow a convenience sampling strategy. To aim for a student population with
diverse academic backgrounds, participants will be recruited broadly through social media platforms,
email lists, and flyers at the research university where the experiment will be conducted. Given that the
experiment will start during the summer, the research team can recruit summer school students as
participants. Thus, the study sample will not be limited to the students presently at the university. The
recruitment materials include a brief description of the study, the eligibility criteria for participation, and
the compensation for participating. Interested individuals can sign up on a calendar by selecting
available time slots provided by the experimenters.
The sequence will be generated using computer-generated random numbers to assign participants in a
1:1 ratio to the intervention group or the control group. The randomization process will be independent of
the recruitment and implementation process. Only participants who fulfill the eligibility criteria and give
consent to participate in the study will be allocated to the randomized sequence.
Not applicable. The randomization procedures are covered in Sections 16a, 16c, and 17a.
Implementation {16c}
Randomization is generated in advance using an R script that allocates participant IDs into either the
intervention group or the control group. The randomization algorithm is independent of the researchers
who will recruit participants and implement the protocol.
Page 14/28
Participants will be blinded in the experiment. Experimenters will not be blinded because they need to
set the computer screen to the appropriate format depending on whether the participant is assigned to
the intervention group or the control group. The analyst will be blinded, and the assignment condition will
be masked to the analyst to minimize potential biases from statistical analysis.
Procedure for unblinding if needed {17b}
Unblinding is not permissible for the participants as it may cause social desirability bias. The
participant’s data will be excluded if the assigned condition is accidentally revealed and will not be
counted in the randomized sequence.
For each participant, the study will take no more than 1.5 hours. The participant may withdraw from the
study at any time. Their compensation will be rounded based on the amount of time they spend in the
experiment. There is no follow-up study.
All data collection during the experiment will be anonymous. The experimental data collection process
will be separated from the collection process for the personally identifiable data required for scheduling
and consent purposes. Pseudonymized IDs will be used to join all data sources. Survey data will be
collected on the Qualtrics platform. Except for Qualtrics, third parties will not have access to these data.
Interaction data with ChatGPT will be collected on the ChatGPT platform under the study team’s account,
exported, and removed after each participant completes their session. All other data (e.g., eye-tracking
data, fNIRS data, writing data in the text editor) will be locally collected on the computer in the
experiment room. All data will be uploaded to the university-owned, encrypted cloud storage service.
Only the study team will have access to the data.
Confidentiality {27}
Data collected on the survey, ChatGPT, and the local computer will be joined using pseudonymized IDs
for data analysis. Participant’s personal information (i.e., name, contact information, and experiment
time slot) will be collected separately only for contacting and scheduling purposes and in case the
participant would like to withdraw their data from the study. The principal investigator and research staff
who conduct the experiment will have access to this information and have been trained before the study
to ensure that they understand the rules for confidentiality and data protection. No other researchers on
the team will have access to the data on personal information. No data will be captured on
Page 15/28
paper/physical media other than the signed consent form and the compensation confirmation form. The
two forms will be stored in a locked cabin in the experiment room.
Plans for collection, laboratory evaluation and storage of biological specimens for genetic or molecular
analysis in this trial/future use {33}
Statistical methods
Our first primary outcome, writing performance, will be treated as a continuous variable. We will use an
ordinary least square (OLS) regression model with robust (i.e., heteroskedasticity-consistent) standard
errors to estimate the intervention effect on writing performance. Our second primary outcome,
cognitive effort as measured by changes in pupil size, will also be treated as a continuous variable. This
outcome is a psychophysiological measure recorded throughout the entire writing process. Standard
pre-processing steps will be taken before the statistical analysis. This includes removing artifacts such
as blinks, interpolating invalid data, downsampling, and correcting pupil size changes based on the
baseline pupil size that will be collected during a 30-second relaxation task at the beginning of the
experiment. Participant-level data will be excluded if the data do not meet the quality standards (e.g.,
having a high proportion of invalid data). After pre-processing, we will use the time series pupil data to
estimate the intervention effect on cognitive effort by running a linear mixed model with participant-level
random effects. Autocorrelation in the time series data will be assessed using the Durbin-Watson (DW)
test. We will determine a time window where the DW test value falls between 1.5 and 2.5 to mitigate
potential underestimation of standard errors due to autocorrelation. We will also conduct a robustness
check to assess if varying in the time window may affect the findings. Time taken to complete the
writing task will be added to the model as a covariate to control for fatigue due to time length. Here, the
intervention effect is estimated by an intent-to-treat analysis.
For the secondary outcomes, all survey scale measures will be treated similarly to the writing
performance outcome. They will be viewed as continuous variables and analyzed using linear
regression. Cognitive effort as measured by changes in the cortical hemodynamics will be treated
similarly to pupil size changes. Through a series of pre-processing steps, we will compute channel-wise
hemodynamic changes in oxygenated hemoglobin (HbO) and deoxygenated hemoglobin (HbR) using the
Satori software. First, the raw data of two wavelengths (760 nm and 850 nm) will be trimmed so only the
fNIRS data collected during the writing process is analyzed. Any channels with a coefficient of variation
(CV) over 10% will be rejected [52]. Second, the raw data will be converted to optical density (OD) using
modified Beer-Lambert law (MBLL) [53]. Third, the OD data will be evaluated using a scalp coupling index
(SCI) below 0.5 [54]. If the correlation between both OD wavelengths of a channel is below 0.5, the
channel will be considered too low to be accepted. After the bad channel rejection, the OD data will be
converted to changes of concentration (CC) data. Fifth, the spikes in the CC data will be detected using a
Page 16/28
robust spike detection method (10 iterations, 4-second lag, 3.5 standard deviations as the threshold, and
influence of 0.3, monotonic interpolation). To further remove motion, the correlation-based signal
improvement (CBSI) algorithm [55] and the Temporal Derivative Distribution Repair (TDDR) [56] will be
used. Sixth, the removal of physiological noise will be done using the eight short channel data through
the principal component analysis (PCA). The first two PCs will be used as regressors to remove
physiological noise from the CC data. Additional temporal filtering methods including linear detrending,
high-pass Butterworth filter (> 0.01 Hz), and low-pass Gaussian smoothing filter (< 0.2 Hz) will be used to
further remove different sources of physiological noise and systemic non-brain activity-related signals.
Finally, the signal in each channel will be mean-centered and related to the standard deviation of the
signal fluctuations through the z-normalization step. After pre-processing, we will use a generalized
linear mixed model to estimate the intervention effect to account for both participant-level and channel-
level random effects.
For the above statistical modeling, we will first run the analyses without adjusting for covariates because
randomization on average eliminates confounding. Subsequently, we will run the analyses with adjusted
covariates, including participants’ self-reported skill levels and motivation because power can often be
improved with covariate adjustments, and such adjustments can improve residual confounding. For skill
levels, we will include three variables based on three aspects of self-reported skill level: writing ability,
critical thinking ability, and English language ability. For motivation, we will include one variable based on
participants’ self-reported motivation to achieve a high-performance score in the analytical writing task.
The four variables will be viewed as continuous variables and all added to the model. Should there be
variations in other baseline measures, such as gender and race, between the intervention group and the
control group, we will further adjust our model to control for these potential confounding sources.
As a robustness check, we will view the pupil size measure during the entire writing task as a repeated
measure and use a linear mixed model to estimate the intervention effect. Specifically, we will divide the
writing process into small time windows (e.g., 30 seconds) and compute the pupil size average for each
window. The linear mixed model will account for the repeated measures over time and participant-level
random effects.
Additionally, we will conduct per-protocol analyses for the intervention effect. We will conduct subgroup
analyses to examine heterogeneity treatment effects, provided that sufficient sample sizes can be
recruited in each group. The variables of interest are prior skill levels in writing ability, critical thinking
ability, and English language ability, as well as motivation. Each of these variables will be examined
independently.
Page 17/28
Methods in analysis to handle protocol non-adherence and any statistical methods to handle missing
data {20c}
Plans to give access to the full protocol, participant level-data and statistical code {31c}
This document is the full protocol. Additional material can be accessed via OSF: https://osf.io/9jgme/?
view_only=1eeba901f67546bd954e53ddec330231.
The coordinating center will be based at the Heidelberg Institute for Global Health (HIGH). The day-to-
day experiment coordination will be managed by the study team at the Core Facility for Neuroscience of
Self-Regulation (CNSR). The principal investigator will provide oversight of the study. The data manager
will be responsible for organizing data collection and ensuring the integrity and quality of the data. The
study coordinator will oversee participant recruitment, study visits, and weekly feedback reports. There is
no trial steering committee or stakeholder and public involvement group.
Composition of the data monitoring committee, its role and reporting structure {21a}
The study will not include a data monitoring committee separate from the study team because there will
be no interim data analyses. The study team is independent from the sponsor of the trial and competing
interests.
This trial is a lab experiment that asks participants to complete a writing task. It is very unlikely to cause
adverse events.
Not applicable. This study is a small-scale lab experiment that does not require external auditing.
Plans for communicating important protocol amendments to relevant parties (e.g. trial participants,
ethical committees) {25}
In the event of substantial amendment, this will be reported to the Ethics Committee at Heidelberg
Medical Faculty. Non-significant amendments will be documented and updated in the online trial
registries. Additional documents will be uploaded to the OSF.
Page 18/28
The results of this study will be disseminated through presentations at international conferences and
publications in peer-reviewed journals.
Discussion
Since the public release of ChatGPT in 2022, there have been heated discussions on the societal
implications of generative AI. Concerns and promises have both been raised about its potential effect on
human cognition when such tools are widely integrated into daily tasks [6, 15]. In this study, we propose
to evaluate the effects of generative AI use on human cognition and task performance, in the context of
a hypothetical analytical writing assignment undertaken by college students.
The main innovation of our study is using multi-modal data to evaluate the effects of generative AI. We
will collect psychophysiological data throughout the writing process using state-of-the-art neuroscience
technologies. Specifically, we will use the Tobii Pro Fusion eye tracker to capture pupil size changes and
gaze patterns. We will use the NirSport2 fNIRS system to measure brain activity. These data will then be
combined and analyzed with behavioral data and self-reported attitudinal data collected in the pre- and
post-surveys. The multi-modality of the data provides a few advantages. First, collecting data from
different modalities will give us a more comprehensive understanding of the effects of generative AI. For
example, combining psychophysiological measures with self-reported measures can provide insights
into both the internal cognitive processes and observable behaviors of participants. Second, multi-modal
data can allow us to validate findings from different data sources. Third, the real-time measures
captured in this study reflect dynamic changes as tasks are performed. These data will provide deeper
insights into how cognitive processes evolve during different phases of a task.
Our study design ensures a high internal validity due to the controlled lab setting. However, this approach
has limitations in generalizability. The recruitment process relies on a convenience sampling strategy, as
the experiment requires equipment located at the university’s research lab. As a result, participants may
represent a WEIRD (Western, Educated, Industrialized, Rich, Democratic) population; as a consequence,
results may not represent the impact of generative AI use in a broader, more diverse population.
Moreover, participants in the experiment may not behave as they would in real-world settings, mainly
because they may be motivated to work on the task in other ways depending on the context. In our study,
we carefully control the motivational context for the writing task. We will measure participants’ general
motivation to achieve a high score before they start working on the writing task and will account for this
variation in our regression models. We will control for external incentives by framing the experiment
setting as a hypothetical writing class and by informing participants that they will receive their
performance scores, the class average, and the instructor’s feedback after completing the task. This
design aims to reflect real-life motivational settings for completing homework assignments. Unlike in
other experimental research evaluating the effects of generative AI (e.g., [1, 4]), we opt not to incentivize
better performance monetarily, as this is not suitable for our study context.
Trial status
Page 19/28
This trial is currently recruiting participants. Recruitment for the trial and all data collection will be
completed by the end of Feb, 2025.
Abbreviations
AI: Artificial Intelligence
DW: Durbin-Watson
HbO: Oxyhaemoglobin
HbR: Deoxyhaemoglobin
Page 20/28
Declarations
Acknowledgements
Not applicable.
YC, SC, and TB conceived the trial. YC, YW, TW, RK, SC, and TB developed the study design. YC, YF, YL, BL,
MY, JZ, and AZ acquired the data and analyzed the data. SC and TB obtained the funding. All authors
provided critical revisions to the manuscript.
Funding {4}
The final trial data are deidentified and will be stored on a university-owned, encrypted cloud storage
service. The study investigators own and have complete control over the research data.
Ethics approval was obtained by the Ethics Committee at Heidelberg Medical Faculty in Germany (#ID: S-
117/2024). Participants must preview an information sheet and sign a consent form before they can
begin the experiment. The information sheet explains the study’s aim, procedures, potential risks and
benefits, compensation, and contact information for the study investigators. The experimenter will
answer any questions that the participant may have before asking for consent. If the participant meets
the inclusion criteria and agrees to participate, they will be asked to sign the consent form, which the
experimenter will counter sign. The participant will receive the information sheet and a copy of the
consent form. The other copy of the consent form is retained by the research team. All participants will
be verbally informed that they can withdraw from the study at any time without giving any reason and
without having any negative consequences to their academic studies. Protocol amendments will be
promptly submitted to the ethics committee.
Not applicable.
Page 21/28
Authors’ information (optional)
References
1. Noy S, Zhang W (2023) Experimental evidence on the productivity effects of generative artificial
intelligence. Science (1979) 381:187–192
2. Brynjolfsson E, Li D, Raymond LR (2023) Generative AI at work. National Bureau of Economic
Research
3. Dell’Acqua F, McFowland III E, Mollick ER, Lifshitz-Assaf H, Kellogg K, Rajendran S, Krayer L,
Candelon F, Lakhani KR (2023) Navigating the jagged technological frontier: Field experimental
evidence of the effects of AI on knowledge worker productivity and quality. Harvard Business School
Technology & Operations Mgt. Unit Working Paper
4. Doshi AR, Hauser OP (2024) Generative AI enhances individual creativity but reduces the collective
diversity of novel content. Sci Adv 10:eadn5290
5. Lee BC, Chung J (2024) An empirical investigation of the impact of ChatGPT on creativity. Nat Hum
Behav 1–9
6. Heersmink R (2024) Use of large language models might affect our cognitive skills. Nat Hum Behav
8:805–806
7. Yan L, Greiff S, Teuber Z, Gašević D (2024) Promises and challenges of generative artificial
intelligence for human learning. Nat Hum Behav 8:1839–1850
8. Dergaa I, Ben Saad H, Glenn JM, Amamou B, Ben Aissa M, Guelmami N, Fekih-Romdhane F, Chamari
K (2024) From tools to threats: a reflection on the impact of artificial-intelligence chatbots on
cognitive health. Front Psychol 15:1259845
9. Sparrow B, Liu J, Wegner DM (2011) Google effects on memory: Cognitive consequences of having
information at our fingertips. Science (1979) 333:776–778
10. Montag C, Markett S (2023) Social media use and everyday cognitive failure: investigating the fear
of missing out and social networks use disorder relationship. BMC Psychiatry 23:872
11. Shors TJ, Anderson ML, Curlik Ii DM, Nokia MS (2012) Use it or lose it: how neurogenesis keeps the
brain fit for learning. Behavioural brain research 227:450–458
12. Birkel L (2017) Decreased use of spatial pattern separation in contemporary lifestyles may
contribute to hippocampal atrophy and diminish mental health. Med Hypotheses 107:55–63
13. Clark A, Chalmers DJ (2010) The extended mind.
14. Risko EF, Gilbert SJ (2016) Cognitive offloading. Trends Cogn Sci 20:676–688
15. Chiriatti M, Ganapini M, Panai E, Ubiali M, Riva G (2024) The case for human–AI interaction as
system 0 thinking. Nat Hum Behav 8:1829–1830
16. Carr N (2020) The shallows: What the Internet is doing to our brains. WW Norton & Company
17. Siemens G, Marmolejo-Ramos F, Gabriel F, Medeiros K, Marrone R, Joksimovic S, de Laat M (2022)
Human and artificial cognition. Computers and Education: Artificial Intelligence 3:100107
Page 22/28
18. Sun L, Zhou L (2024) Does Generative Artificial Intelligence Improve the Academic Achievement of
College Students? A Meta-Analysis. Journal of Educational Computing Research.
https://doi.org/10.1177/07356331241277937
19. Bastani H, Bastani O, Sungu A, Ge H, Kabakcı O, Mariman R (2024) Generative ai can harm learning.
Available at SSRN 4895486:
20. Darvishi A, Khosravi H, Sadiq S, Gašević D, Siemens G (2024) Impact of AI assistance on student
agency. Comput Educ 210:104967
21. Lehmann M, Cornelius PB, Sting FJ (2024) AI Meets the Classroom: When Does ChatGPT Harm
Learning? arXiv preprint arXiv:2409.09047
22. Mislevy RJ, Behrens JT, Dicerbo KE, Levy R (2012) Design and discovery in educational assessment:
Evidence-centered design, psychometrics, and educational data mining. Journal of educational data
mining 4:11–48
23. Lodge JM (2018) A Futures Perspective on Information Technology and Assessment. In: Voogt J,
Knezek G, Christensen R, Lai K-W (eds) Second Handbook of Information Technology in Primary and
Secondary Education. Springer International Publishing, Cham, pp 1–13
24. Lund K (2011) Analytical frameworks for group interactions in CSCL systems. Analyzing Interactions
in CSCL: Methods, Approaches and Issues 391–411
25. Kizilcec RF, Pérez-Sanagustín M, Maldonado JJ (2017) Self-regulated learning strategies predict
learner behavior and goal attainment in Massive Open Online Courses. Comput Educ 104:18–33
26. Swiecki Z, Khosravi H, Chen G, Martinez-Maldonado R, Lodge JM, Milligan S, Selwyn N, Gašević D
(2022) Assessment in the age of artificial intelligence. Computers and Education: Artificial
Intelligence 3:100075
27. GRE General Test Analytical Writing Overview.
28. Liu OL, Frankel L, Roohr KC (2014) Assessing critical thinking in higher education: Current state and
directions for next‐generation assessment. ETS Research Report Series 2014:1–23
29. Dwyer CP, Hogan MJ, Stewart I (2014) An integrated critical thinking framework for the 21st century.
Think Skills Creat 12:43–52
30. Halpern DF (1998) Teaching critical thinking for transfer across domains: Disposition, skills,
structure training, and metacognitive monitoring. American psychologist 53:449
31. Breyer FJ, Attali Y, Williamson DM, Ridolfi‐McCulla L, Ramineni C, Duchnowski M, Harris A (2014) A
study of the use of the e‐rater® scoring engine for the analytical writing measure of the GRE®
revised General Test. ETS Research Report Series 2014:1–66
32. Meyer J, Jansen T, Schiller R, Liebenow LW, Steinbach M, Horbach A, Fleckenstein J (2024) Using
LLMs to bring evidence-based feedback into the classroom: AI-generated feedback increases
secondary students’ text revision, motivation, and positive emotions. Computers and Education:
Artificial Intelligence 6:100199
Page 23/28
33. Achiam J, Adler S, Agarwal S, Ahmad L, Akkaya I, Aleman FL, Almeida D, Altenschmidt J, Altman S,
Anadkat S (2023) Gpt-4 technical report. arXiv preprint arXiv:2303.08774
34. Laeng B, Alnaes D (2019) Pupillometry. Eye movement research: An introduction to its scientific
foundations and applications 449–502
35. Van der Wel P, Van Steenbergen H (2018) Pupil dilation as an index of effort in cognitive control
tasks: A review. Psychon Bull Rev 25:2005–2015
36. Friedman NP, Robbins TW (2022) The role of prefrontal cortex in cognitive control and executive
function. Neuropsychopharmacology 47:72–89
37. Yuan P, Raz N (2014) Prefrontal cortex and executive functions in healthy adults: a meta-analysis of
structural neuroimaging studies. Neurosci Biobehav Rev 42:180–192
38. Kim H (2019) Neural activity during working memory encoding, maintenance, and retrieval: A
network‐based model and meta‐analysis. Hum Brain Mapp 40:4912–4933
39. Rottschy C, Langner R, Dogan I, Reetz K, Laird AR, Schulz JB, Fox PT, Eickhoff SB (2012) Modelling
neural correlates of working memory: a coordinate-based meta-analysis. Neuroimage 60:830–846
40. Storkerson P (2010) Naturalistic Cognition: A Research Paradigm for Human-Centered Design. J Res
Pract 6:M12
41. Virk T, Letendre T, Pathman T (2024) The convergence of naturalistic paradigms and cognitive
neuroscience methods to investigate memory and its development. Neuropsychologia 196:108779
42. Hart SG (2006) NASA-task load index (NASA-TLX); 20 years later. In: Proceedings of the human
factors and ergonomics society annual meeting. Sage publications Sage CA: Los Angeles, CA, pp
904–908
43. Hart SG, Staveland LE (1988) Development of NASA-TLX (Task Load Index): Results of Empirical
and Theoretical Research. In: Hancock PA, Meshkati N (eds) Advances in Psychology. North-
Holland, pp 139–183
44. Gaab J (2009) PASA–primary appraisal secondary appraisal. Verhaltenstherapie 19:114–115
45. Pollak A, Paliga M, Pulopulos MM, Kozusznik B, Kozusznik MW (2020) Stress in manual and
autonomous modes of collaboration with a cobot. Comput Human Behav 112:106469
46. Bruning R, Dempsey M, Kauffman DF, McKim C, Zumbrunn S (2013) Examining dimensions of self-
efficacy for writing. J Educ Psychol 105:25
47. Hulleman CS, Godes O, Hendricks BL, Harackiewicz JM (2010) Enhancing interest and performance
with a utility value intervention. J Educ Psychol 102:880
48. Albayati H (2024) Investigating undergraduate students’ perceptions and awareness of using
ChatGPT as a regular assistance tool: A user acceptance perspective study. Computers and
Education: Artificial Intelligence 6:100203
49. Dhillon PS, Molaei S, Li J, Golub M, Zheng S, Robert LP (2024) Shaping Human-AI Collaboration:
Varied Scaffolding Levels in Co-writing with Language Models. In: Proceedings of the CHI
Conference on Human Factors in Computing Systems. pp 1–18
Page 24/28
50. Zhang F, Wagner AK, Ross-Degnan D (2011) Simulation-based power calculation for designing
interrupted time series analyses of health policy interventions. J Clin Epidemiol 64:1252–1261
51. Yeung MK, Lee TL, Han YMY, Chan AS (2021) Prefrontal activation and pupil dilation during n-back
task performance: A combined fNIRS and pupillometry study. Neuropsychologia 159:107954
52. Zimeo Morais GA, Scholkmann F, Balardin JB, Furucho RA, de Paula RCV, Biazoli Jr CE, Sato JR
(2018) Non-neuronal evoked and spontaneous hemodynamic changes in the anterior temporal
region of the human head may lead to misinterpretations of functional near-infrared spectroscopy
signals. Neurophotonics 5:11002
53. Scholkmann F, Wolf M (2013) General equation for the differential pathlength factor of the frontal
human head depending on wavelength and age. J Biomed Opt 18:105004
54. Pollonini L, Olds C, Abaya H, Bortfeld H, Beauchamp MS, Oghalai JS (2014) Auditory cortex
activation to natural speech and simulated cochlear implant speech measured with functional near-
infrared spectroscopy. Hear Res 309:84–93
55. Cui X, Bray S, Reiss AL (2010) Functional near infrared spectroscopy (NIRS) signal improvement
based on negative correlation between oxygenated and deoxygenated hemoglobin dynamics.
Neuroimage 49:3039–3046
56. Fishburn FA, Ludlum RS, Vaidya CJ, Medvedev A V (2019) Temporal derivative distribution repair
(TDDR): a motion correction method for fNIRS. Neuroimage 184:171–179
Figures
Page 25/28
Figure 1
The trial design with participants randomized into the intervention group and the control group in a 1:1
ratio
Page 26/28
Figure 2
Page 27/28
Figure 3
Page 28/28