1 Introduction
In education, writing is a prevalent pedagogical practice employed by educators to facilitate students’ learning [
43]. It can benefit students in various ways, such as developing analytical thinking [
28], identifying knowledge gaps [
26], and enhancing communication skills [
41]. In a well-known cognitive process model of writing [
26], writing is understood as a complex task that requires the coordination of three core cognitive processes:
Planning,
Translating, and
Reviewing. During the
Planning phase, writers engage in generating ideas, setting goals, and organizing their thoughts. This stage involves the decision-making process regarding what message the writer wants to communicate, the goals they hope to achieve, and the structure they will use to present their ideas logically. The
Translating phase follows, where writers take the ideas formed in the planning stage and convert them into written language. This process involves selecting appropriate vocabulary, constructing grammatically correct sentences, and ensuring the text is coherent and flows smoothly. Finally, in the
Reviewing phase, writers critically assess their draft by reading through the text to evaluate its overall quality. This stage involves identifying potential problems, such as unclear wording, weak arguments, or grammatical errors, and making necessary revisions. Writers may adjust content, refine style, or improve accuracy to enhance the effectiveness of their writing. In education, students are expected to actively engage in all three stages to achieve meaningful learning via writing tasks [
26].
With the advent and increasing popularity of Generative AI (GAI), more and more higher education institutes have embraced GAI to support their teaching and learning, and GAI-assisted writing is now prevalent among students in higher education [
34]. In GAI-assisted writing, students can delegate some of the core rhetorical and cognitive workload to GAI and receive help in content creation, improving creativity, and enhancing efficiency [
36]. However, the use of GAI in writing introduces challenges for educators in supporting and assessing students’ learning, as tasks that are originally undertaken by students to achieve meaningful learning are now performed by GAI. Specifically, students are traditionally responsible for independently planning, translating, and revising their work. With GAI support, students can now use GAI to generate ideas, convert those ideas into written text, and even request GAI to directly revise their written works. Unlike traditional writing, where the final product reflects the student’s individual effort, GAI-assisted writing produces a combination of student-created and AI-generated content. Consequently, traditional writing assessment methods, which often focus on evaluating the quality of students’ written product, might not be suitable in GAI-assisted contexts. Therefore, we argue it is crucial to understand the relationship between students’ different GAI-assisted writing behaviors and the quality of the written essay. This will help determine whether improvements in essay quality are primarily driven by the capabilities of GAI or by meaningful learning undertaken by students during the writing process. In addition, it will guide educators in assessing whether more advanced assessment methods (e.g., considering students’ in-process writing logs [
45]) are even more demanded than ever in this new context. Several studies have been conducted to assess whether essay quality improves when written with GAI [
23,
47]. However, these studies were often limited in that they only investigated the overall impact of using GAI to support writing by comparing two groups: one with GAI-powered writing assistance and the other without, shedding little light on how different GAI-assisted writing behaviors influence writing quality. Such a binary comparison (i.e., with or without GAI) fails to account for differences in how students interact with GAI, which can affect both the writing process and outcomes. Moreover, it does not clarify whether meaningful learning occurs in the GAI-assisted context, as high-quality essays may simply result from GAI doing much of the work.
In GAI-assisted writing, it is common for students to directly incorporate GAI-generated text into their writing [
44]. However, previous research has demonstrated that Large Language Models (LLMs, a main subset of GAI) can carry and amplify significant biases (e.g., gender, racial, and socioeconomic) [
38]. These biases arise from the training datasets, which often contain historical and societal inequities, reflecting real-world disparities in the representation of different genders, racial groups, and cultures [
2]. This raises concerns within the educational community, as linguistic biases, such as the use of terms like "
female nurse" or "
male doctor," embedded in LLMs may lead to biased outputs, which could be incorporated into students’ work. A recent study underscored the risk of LLMs perpetuating harmful stereotypes through automated writing tools, potentially influencing students’ writing styles and perspectives unintentionally [
61]. Therefore, in addition to traditional measures of writing quality such as coherence, syntax, and vocabulary [
33], it is equally important to take the linguistic bias into account to measure writing quality and investigate its relationship with different GAI-assisted writing behaviors.
To address these gaps, this study aimed to answer the following Research Question (RQ):
What writing behaviors contribute to the quality of written products in the setting of GAI-assisted writing? In this study, we chose a publicly available dataset, CoAuthor [
44], containing 1,445 GAI-assisted writing sessions, including both final outputs and log trace events from the GAI-assisted writing process. We focused on three types of GAI-assisted writing behavioral patterns that may indicate meaningful learning or its absence, namely
only seeking suggestions from GAI but not accepting them,
seeking suggestions from GAI and accepting them as they are, and
seeking suggestions from GAI and accepting them with modification. To assess the quality of the written products, we selected three measures frequently used in previous writing research, namely
lexical sophistication,
syntactic complexity, and
text cohesion [
18]. Additionally, we incorporated
gender bias (e.g., "
She was too emotional to make a rational decision.") as a representative of linguistic biases to further evaluate the quality of the written products. We aimed to identify causal relationships between GAI-assisted writing behaviors and the resulting quality in the written products to address the RQ so that such insights can be directly employed to guide real-world pedagogical writing practices. While randomized controlled trials (RCTs) are often considered the gold standard for establishing causal relationships [
12], practical constraints such as ethical concerns and implementation challenges [
32] make RCTs not always feasible. In our study, the variety of writing behaviors (e.g., accepting or modifying GAI suggestions) and contexts (e.g., different writing topics) makes designing RCTs particularly difficult. Instead, we applied causal modeling, a statistical method designed to uncover and understand cause-and-effect relationships using observational data [
25]. Specifically, we treated the GAI-assisted writing behavioral patterns displayed by writers as
treatments and the four adopted essay quality measures as
outcomes. Then, we encoded all treatments and outcomes into a Directed Acyclic Graph to model the causal relationships among them while conditioning on factors such as writers’ first-language backgrounds and the genre of a writing task. Then, the causal relationships were identified using the widely-used back-door criterion [
48] and estimated through the state-of-the-art X-learner algorithms [
39]. Further details can be found in Section
3. Through extensive analysis, this study contributed with the following major findings: (i) The direct use of GAI-generated text, which suggests minimal learning by the writer, often reduces the quality of an essay in terms of lexical sophistication, syntactic complexity, and cohesion. However, actively revising GAI-generated text, which involves meaningful learning during the writing process, can significantly improve essay quality across all three measures. (ii) When writing independently, human writers are likely to introduce linguistic bias into their text. By incorporating GAI-generated text—whether through revision or direct use—writers can produce essays of higher quality with reduced linguistic bias. (iii) Non-native English writers often benefit from GAI in improving the lexical sophistication and syntactic complexity of their essays. However, they tend to exhibit a higher degree of linguistic bias when writing primarily on their own or revising GAI-suggested text. (iv) Actively revising GAI-suggested text tends to improve the cohesion of essays more in creative writing, but has a lesser impact in argumentative writing.
4 Results
4.1 Results on Average Treatment Effect
The results of ATE estimations and the refutation tests for each pair of treatment and outcome are presented in Table
2, which provides insight into the overall causal relationship between treatments and outcomes across the entire population. As shown in Table
2, all the causal models for each treatment-outcome pair successfully passed the three refutation tests (i.e., RCC, Placebo, and DSR), with all p-values exceeding 0.05. This indicates that our causal inference results are both robust and credible.
T1 (seek suggestions -> not accept) was associated with a reduction in both Y1 (lexical sophistication) by 0.065 and Y2 (syntactic complexity) by 0.236, but it led to a slight increase in Y3 (text cohesion) by 0.001. In comparison, T2 (seek suggestions -> accept without revision) exhibited the most substantial negative impact across all three quality measures, reducing Y1 by 0.086, Y2 by 0.869, and Y3 by 0.007. Interestingly, T3 (seek suggestions -> first accept and the revise) improved all three quality metrics, with increases in Y1 by 0.102, Y2 by 0.963, and Y3 by 0.008. These results suggest that writers who frequently rely on GAI to write for them exhibit the poorest performance in terms of writing quality. In contrast, writers who use GAI primarily for ideation but compose their own text perform better, particularly showing a positive impact on text cohesion. Writers who actively revise GAI suggestions achieve the highest performance in writing quality, indicating that the process of critically engaging with and refining GAI-generated suggestions enhances lexical sophistication, syntactic complexity, and text cohesion in the written work. For Y4 (gender bias), all three GAI-assisted writing behaviors result in a decrease in gender bias. T1 shows the largest decrease (about 0.163), compared to smaller reductions for T2 (0.012) and T3 (0.013). This appears reasonable, as T1 does not directly incorporate GAI-generated text, while T2 and T3 both involve accepting GAI-generated content, whether modified or not. This aligns with existing research suggesting that LLMs often contain biases. Writers who modify GAI suggestions can critically assess and adjust biased language, making it slightly better than directly accepting GAI suggestions.
4.2 Results on Individual Treatment Effect
To explore the effect of treatments on outcomes at the individual level (i.e., ITE), we visualized the Beeswarm plots for each treatment, conditioned on different confounders. Figure
2 shows this for T1, Figure
3 for T2, and Figure
4 for T3. To further highlight the impact of treatments on outcomes by confounders, summary results are provided in Tables
3. Only clear patterns (i.e., where most dots of the same color accumulate on either the left or right of the y-axis) observed in the Beeswarm plots were included in the summary table. These patterns were empirically selected and verified by three authors. Note that C2, containing 20 different writing topics, did not exhibit clear trends and is therefore omitted.
When analyzing T1’s ITE on outcomes by confounders, as presented in Figure
2, we are essentially examining how seeking external GAI suggestions during the planning stage, without accepting those suggestions, influences the quality of the final product across various confounder groups. For T2’s ITE on outcomes by confounders, shown in Figure
3, we are investigating how directly incorporating AI-generated sentences during the translation stage affects the final product’s quality across different confounder groups. Lastly, in the case of T3’s ITE on outcomes by confounders, displayed in Figure
4, we are exploring how writers refining AI-generated text to better align with their tone or content during the reviewing stage impacts the quality of the final product across various confounder groups. As demonstrated in Table
3, we observed several noteworthy findings, especially within certain confounder groups, where we even identified trends that contradicted (i.e., indicated by arrows with stars in the table) the ATE (i.e., the overall average causal effect across the entire population).
For Y1 (lexical sophistication), we first observed that seeking GAI suggestions without directly accepting them (T1) has a positive effect on lexical sophistication in argumentative writing, while the effect is negative in creative writing. A possible reason is that GAI often produces more logical and formal text, which aligns with the demands of argumentative writing. Writers may be inspired by GAI suggestions to use more advanced vocabulary, even if they do not directly accept the suggestions. Additionally, we found that non-native English writers tended to benefit from either accessing (T1) or directly adopting GAI-generated text (T2). This may be because non-native writers often have a smaller vocabulary compared to their native English counterparts, and exposure to or direct use of advanced vocabulary in GAI suggestions may enhance the lexical sophistication of their written products.
For Y2 (syntactic complexity), we first observed that directly adopting GAI-generated text without modification (T2) positively impacts syntactic complexity for non-native English writers. This aligns with our findings on Y1, where non-native writers may tend to use simpler syntactic structures compared to their native counterparts. By incorporating GAI-generated text into their essays, non-native writers can enhance the syntactic complexity of their final products. We also found that in high GPT temperature and frequency penalty settings, writers who frequently seek GAI suggestions without directly accepting them (T1) tend to produce more syntactically complex essays. One possible explanation is that GPT-generated text introduces varied vocabulary and greater creativity in these settings, inspiring writers to produce more complex structures in their own writing. Lastly, we found that writers who prefer to accept and revise GAI suggestions (T3) may produce essays with lower syntactic complexity in low GPT frequency penalty settings. This highlights that even writers who demonstrate meaningful learning by actively revising GAI-generated content may still produce essays with less syntactic complexity in these settings. Therefore, assessing only the final written products in GAI-assisted writing tasks is insufficient. Educators need to incorporate new assessment methods to better support students in such contexts.
For Y3 (text cohesion), we first observed that writers who frequently accept GAI suggestions without further modification (T2) tend to produce more cohesive text compared to those who actively modify GAI suggestions (T3) in argumentative writing tasks. One possible reason is that GAI can generate logical and formal text that aligns with the requirements of argumentative writing. Consequently, T2 benefits from incorporating such text directly into their essays, while T3 may reduce the effectiveness of GAI suggestions through modifications. This underscores our argument that meaningful learning during the writing process may not always result in cohesive essays for certain confounding groups, indicating that assessing only the final written product in GAI-assisted tasks is insufficient. We also found that non-native writers may reduce text cohesion if they frequently seek GAI suggestions without directly accepting them (T1). A potential explanation is that although these writers, who may face more difficulty writing in English, gain inspiration from GAI-generated content, they still struggle to produce cohesive text on their own. Lastly, when GPT temperature is high, meaning the GAI-generated text becomes more diverse and creative, text cohesion tends to benefit less from active revisions of GAI suggestions. This may be due to the difficulty of revising creative, varied text to improve overall cohesion in an essay.
For Y4 (gender bias), we first found that non-native English writers who directly use GAI-generated text (T2) tend to produce essays with lower gender bias compared to those written primarily on their own (T1) or those consisting of revised GAI-generated text (T3). This suggests that non-native writers may unintentionally introduce biased language, highlighting the need for additional support to help them avoid biased terms in their writing. We also found that in low GPT frequency penalty settings, the direct use of GAI-generated text (T2) may result in less biased essays. This could be because low frequency penalty settings often lead to less varied vocabulary, potentially reducing the likelihood of generating biased terms.
5 Discussion and Conclusion
In this study, we used causal modeling to identify causal relationships within an observational GAI-assisted writing dataset. We defined three distinct treatments: T1 (seek suggestions -> not accept), T2 (seek suggestions -> accept without revision), and T3 (seek suggestions -> first accept and the revise). Additionally, we identified four outcome measures related to the written outputs: Y1 (lexical sophistication), Y2 (syntactic complexity), Y3 (text cohesion), and Y4 (gender bias). To control for potential confounding variables, we considered five confounders: C1 (writing genre), C2 (writing topic), C3 (language background), C4 (GPT temperature), and C5 (GPT frequency penalty). We applied the state-of-the-art X-learner algorithm to infer causal relationships and analyzed both the ATE and ITE.
Discussion. The ATE results show that T3 consistently and significantly improves all writing quality measures (Y1, Y2, and Y3), while T2 tends to significantly reduce all of these measures. T1 slightly improves Y3, but reduces both Y1 and Y2. The positive effect of T3 suggests that writers who actively engage with AI-generated content by critically refining GAI suggestions produce writing with more sophisticated vocabulary, complex sentence structures, and cohesive content. These findings are consistent with prior research, which suggests that deeper interaction with AI tools fosters critical thinking and creativity in language use [
54]. These results also support the argument that the review phase is critical for improving text quality and fostering learning, as highlighted in traditional writing research [
7,
60]. Interestingly, the negative effects of T1 and T2 suggest that over-reliance on GAI suggestions for generating ideas and text may be detrimental to students’ learning. This could be because GAI reduces the need for writers to brainstorm ideas and construct complex sentences on their own. This reinforces the notion that GAI may hinder active linguistic engagement when writers depend too heavily on GAI [
1]. Based on these findings, we argue that students’ GAI-assisted writing behaviors can influence the quality of written products, with the final product reflecting not only their writing abilities but also how they use GAI. Therefore, educators should focus not only on the written product in GAI-assisted writing but also on students’ engagement with GAI (e.g., whether they simply accept GAI suggestions or thoughtfully revise them) during the writing process and how this engagement impacts the writing quality. Furthermore, educators should train students to critically refine AI-generated suggestions throughout the writing process, helping them adapt these suggestions to align with their own voice and ideas.
Implications. The ITE results provide several important implications. Firstly, we argue that linguistic bias in written text has not been thoroughly explored in previous literature, particularly in the context of GAI-assisted writing tasks, and warrants further investigation. Our findings indicate that all three GAI-assisted writing behaviors contribute to a reduction in gender bias, suggesting that linguistic bias is already present in students’ writing, even without GAI assistance. Notably, non-native English writers are more likely to unintentionally amplify gender biases when modifying GAI suggestions or writing independently, likely due to unfamiliarity with nuanced gendered language. Additionally, the reduction in bias is significantly greater in T1 compared to T2 and T3, which is expected, as T1 does not directly incorporate GAI-generated text, whereas T2 and T3 do. This underscores the importance of designing targeted training for educators to address bias in both GAI outputs and students’ writing, especially for non-native writers. Secondly, students with different demographic attributes require tailored support in various writing and GAI settings to achieve meaningful learning through GAI-assisted writing practices. For instance, writers who demonstrated meaningful learning by actively modifying GAI suggestions in low GPT temperature settings often produced essays with lower syntactic complexity. Educators should recognize this and offer targeted guidance to help these writers effectively adjust GAI suggestions, ensuring that their revisions align with their own voice and ideas while enhancing the syntactic complexity of their writing. Similarly, in argumentative writing, students who displayed meaningful learning often produced essays with lower text cohesion. Educators should also provide guidance on how to effectively modify GAI suggestions in argumentative writing tasks to help students create more cohesive essays. Thirdly, there is a need to develop new in-process assessment methods for GAI-assisted writing to better evaluate students’ writing performance and enhance meaningful learning. It is evident that some non-native writers, who showed less meaningful learning by frequently accepting GAI suggestions without modification, still produced high-quality essays. Similarly, in high GPT temperature settings, students who exhibited less meaningful learning also managed to create high-quality essays. This highlights that evaluating only the final written products in GAI-assisted writing tasks is insufficient and may cause some students to miss valuable learning opportunities during the writing process.
Limitations. We acknowledge several limitations in our study. Firstly, our findings are based on the CoAuthor dataset, which may limit their generalizability to other writing contexts with different subjects or requirements. In future work, we aim to gather additional data from more diverse writing settings to further validate our results. Secondly, the limited availability of writer-related information, such as GAI literacy, restricts the inclusion of relevant confounders. To address this, we plan to incorporate surveys or tests in future data collection efforts to better capture these factors. Thirdly, the imbalance in data distribution (e.g., no-native writers vs. native writers) within the dataset may have influenced the results. To mitigate this, we intend to recruit a more balanced group of writers in future studies.